Langbahn Team – Weltmeisterschaft

Block floating point

Block floating point (BFP) is a method used to provide an arithmetic approaching floating point while using a fixed-point processor. BFP assigns a group of significands (the non-exponent part of the floating-point number) to a single exponent, rather than single significand being assigned its own exponent. BFP can be advantageous to limit space use in hardware to perform the same functions as floating-point algorithms, by reusing the exponent; some operations over multiple values between blocks can also be done with a reduced amount of computation.[1]

The common exponent is found by data with the largest amplitude in the block. To find the value of the exponent, the number of leading zeros must be found (count leading zeros). For this to be done, the number of left shifts needed for the data must be normalized to the dynamic range of the processor used. Some processors have means to find this out themselves, such as exponent detection and normalization instructions.[2][3]

Block floating-point algorithms were extensively studied by James Hardy Wilkinson.[4][5][6]

BFP can be recreated in software for smaller performance gains.

Microscaling (MX) Formats

Microscaling (MX) formats are a type of Block Floating Point (BFP) data format specifically designed for AI and machine learning workloads. The MX format, endorsed and standardized by major industry players such as AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, represents a significant advancement in narrow precision data formats for AI.[7][8][9]

The MX format uses a single shared scaling factor (exponent) for a block of elements, significantly reducing the memory footprint and computational resources required for AI operations. Each block of k elements shares this common scaling factor, which is stored separately from the individual elements.

The initial MX specification introduces several specific formats, including MXFP8, MXFP6, MXFP4, and MXINT8. These formats support various precision levels:

  • MXFP8: 8-bit floating-point with two variants (E5M2 and E4M3).
  • MXFP6: 6-bit floating-point with two variants (E3M2 and E2M3).
  • MXFP4: 4-bit floating-point (E2M1).
  • MXINT8: 8-bit integer.

MX formats have been demonstrated to be effective in a variety of AI tasks, including large language models (LLMs), image classification, speech recognition and recommendation systems.[10] For instance, MXFP6 closely matches FP32 for inference tasks after quantization-aware fine-tuning, and MXFP4 can be used for training generative language models with only a minor accuracy penalty.

The MX format has been standardized through the Open Compute Project (OCP) as Microscaling Formats (MX) Specification v1.0.[7] An emulation libraries also has been published to provide details on the data science approach and select results of MX in action.[11]

Hardware support

The following hardware supports BFP operations:

  • d-Matrix Jayhawk II[12][13]
  • Tenstorrent Grayskull e75 and e150 (BFP8, BFP4 and BFP2)[14]
  • Tenstorrent Wormhole n150 and n300 (BFP8, BFP4 and BFP2)[14]
  • Amd Strix Point APU (branded as Ryzen AI 300 series) supports Block FP16 in NPU[15][16]
  • AMD Versal AI Edge Series Gen 2 supports MX6 and MX9 data types
  • x86 processors implementing the AVX10.2 extension set support E5M2 and E4M3[17]

See also

References

  1. ^ "Block floating point". BDTI DSP Dictionary. Berkeley Design Technology, Inc. (BDTI). Archived from the original on 2018-07-11. Retrieved 2015-11-01.
  2. ^ Chhabra, Arun; Iyer, Ramesh (December 1999). "TMS320C55x A Block Floating Point Implementation on the TMS320C54x DSP" (PDF) (Application report). Digital Signal Processing Solutions. Texas Instruments. SPRA610. Archived (PDF) from the original on 2018-07-11. Retrieved 2018-07-11.
  3. ^ Elam, David; Iovescu, Cesar (September 2003). "A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP" (PDF) (Application report). TMS320C5000 Software Applications. Texas Instruments. SPRA948. Archived (PDF) from the original on 2018-07-11. Retrieved 2015-11-01.
  4. ^ Wilkinson, James Hardy (1963). Rounding Errors in Algebraic Processes (1 ed.). Englewood Cliffs, NJ, USA: Prentice-Hall, Inc. ISBN 978-0-486-67999-0. MR 0161456.
  5. ^ Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1 ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN 978-0-8176-4704-9. LCCN 2009939668.
  6. ^ Overton, Michael L. (2001). Numerical Computing with IEEE Floating Point Arithmetic - Including One Theorem, One Rule of Thumb and One Hundred and One Exercises (1 ed.). Society for Industrial and Applied Mathematics (SIAM). ISBN 0-89871-482-6. 9-780898-714821-90000.
  7. ^ a b "Open Compute Project". Open Compute Project. Retrieved 2024-06-03.
  8. ^ Rouhani, Bita Darvish; Zhao, Ritchie; More, Ankit; Hall, Mathew; Khodamoradi, Alireza; Deng, Summer; Choudhary, Dhruv; Cornea, Marius; Dellinger, Eric (2023-10-19). "Microscaling Data Formats for Deep Learning". arXiv:2310.10537 [cs.LG].
  9. ^ D'Sa, Rani Borkar, Reynold (2023-10-17). "Fostering AI infrastructure advancements through standardization". Microsoft Azure Blog. Retrieved 2024-06-03.{{cite web}}: CS1 maint: multiple names: authors list (link)
  10. ^ Rouhani, Bita; Zhao, Ritchie; Elango, Venmugil; Shafipour, Rasoul; Hall, Mathew; Mesmakhosroshahi, Maral; More, Ankit; Melnick, Levi; Golub, Maximilian (2023-04-12). "With Shared Microexponents, A Little Shifting Goes a Long Way". arXiv:2302.08007 [cs.LG].
  11. ^ microsoft/microxcaling, Microsoft, 2024-05-29, retrieved 2024-06-03
  12. ^ Clarke, Peter (2023-08-28). "Chiplet-base generative AI platform raises LLM performance". eeNews Europe. Retrieved 2024-04-23.
  13. ^ [SPCL_Bcast] A chiplet based generative inference architecture with block floating point datatypes. Retrieved 2024-04-23 – via www.youtube.com.
  14. ^ a b "Tenstorrent AI Accelerators" (PDF).
  15. ^ Bonshor, Gavin. "AMD Announces The Ryzen AI 300 Series For Mobile: Zen 5 With RDNA 3.5, and XDNA2 NPU With 50 TOPS". www.anandtech.com. Retrieved 2024-06-03.
  16. ^ "AMD Extends AI and High-Performance Leadership in Data Center and PCs with New AMD Instinct, Ryzen and EPYC Processors at Computex 2024". Advanced Micro Devices, Inc. 2024-06-02. Retrieved 2024-06-03.
  17. ^ "Intel Advanced Vector Extensions 10.2 (Intel AVX10.2) Architecture Specification". Intel. 2024-10-16. p. 39. 361050-002US. Retrieved 2024-12-27.

Further reading