FP8 data type - all values in a table

Introduction

Floating-point numbers are a great invention. Thanks to dedicating separate bits to the sign, exponent, and mantissa (also called significand), they can represent a wide range of numbers on a limited number of bits - numbers that are positive or negative, very large or very small (close to zero), integer or fractional.

In programming, we typically use double-precision (64b) or single-precision (32b) numbers. These are the data types available in programming languages (like double and float in C/C++) and supported by processors, which can perform calculations on them efficiently. Those of you who deal with graphics programming using graphics APIs like OpenGL, DirectX, or Vulkan, may know that some GPUs also support 16-bit floating-point type, also known as half-float. For example, HLSL (the shader language used with DirectX) defines type min16float with 16b of minimum precision since Windows 8, and explicit 16b type float16_t added in Shader Model 6.2. See also "Scalar data types" in the HLSL reference documentation.

Such 16b "half" type obviously has limited precision and range compared to the "single" or "double" version. I summarized capabilities and limits of these 3 types in a table in my old "Floating-Point Formats Cheatsheet". Because of these limitations, using a half-float instead of the standard float may not work correctly in all cases. For example, it may be enough to represent RGB components of a color (even when using HDR), or a normal vector that points to a direction, but it won't be sufficient to accurately represent a position in 3D space. It is also easy to exceed its maximum range, e.g. when calculating a dot product of two vectors. I planned to write an entire article about advantages and pitfalls of using half-float numbers in shaders, but I never did it.

Machine learning and FP8

Now, as artificial intelligence (AI) / machine learning (ML) is a popular topic, programmers use low precision numbers in this domain. Similarly to graphics, such workloads feature lots of matrix multiplications and other regular operations where memory storage size, bandwidth, and fast computations play crucial role. Next to the standard half-float (with 5b of exponent and 10b for mantissa), the "bfloat16" format was developed (named after "brain float") with 8b of exponent and 7b for mantissa, which extends its maximum range at the expense of precision.

Machine learning models are sometimes quantized to only 8-bit numbers. They may be integers, but 8-bit floating-point numbers were also designed. There are multiple proposed formats of such numbers. My favorite summary about them is the page "Float stored in 8 bits" in the ONNX documentation. Two main papers that introduce these formats are probably:

  1. "FP8 Formats for Deep Learning" by many authors from NVIDIA, Arm, and Intel.
  2. 8-bit Numerical Formats for Deep Neural Networks" by Badreddine Noune, Phil Jones, Daniel Justus, Dominic Masters, and Carlo Luschi.

They propose formats that have a sign bit and 4 bits of exponent + 3 bits of mantissa (E4M3) or 5 bits of exponent and only 2 bits of mantissa (E5M2). E4M3 is preferred for storing weights and inference (forward pass), while E5M2, also sometimes called bf8, is preferred for storing gradients and training (backward pass). They've tested and proven for many kinds of machine learning models that these 8b data types may perform much better than int8 while having the same storage size, and almost as well as larger fp16 or bf16.

After learning about these 8b formats, I thought that 256 different possible values is little enough that they could be all visualized in a 16x16 table. Thus, I prepared such tables for the 4 FP8 data types described in the ONNX documentation:

FLOAT8E4M3FN

This format has a sign bit + 4 bits of exponent + 3 bits of mantissa. It supports +0 and -0, +NaN and -NaN, but it doesn't have infinites, hence "FN" (finite) in its name.

   -0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F
   -0000-0001-0010-0011-0100-0101-0110-0111-1000-1001-1010-1011-1100-1101-1110-1111
0-0000- +00.0019530.0039060.0058590.0078120.0097660.011720.013670.015620.017580.019530.021480.023440.025390.027340.0293
1-0001- 0.031250.035160.039060.042970.046880.050780.054690.058590.06250.070310.078120.085940.093750.10160.10940.1172
2-0010- 0.1250.14060.15620.17190.18750.20310.21880.23440.250.28120.31250.34380.3750.40620.43750.4688
3-0011- 0.50.56250.6250.68750.750.81250.8750.937511.1251.251.3751.51.6251.751.875
4-0100- 22.252.52.7533.253.53.7544.555.566.577.5
5-0101- 891011121314151618202224262830
6-0110- 32364044485256606472808896104112120
7-0111- 128144160176192208224240256288320352384416448+NaN
8-1000- -0-0.001953-0.003906-0.005859-0.007812-0.009766-0.01172-0.01367-0.01562-0.01758-0.01953-0.02148-0.02344-0.02539-0.02734-0.0293
9-1001- -0.03125-0.03516-0.03906-0.04297-0.04688-0.05078-0.05469-0.05859-0.0625-0.07031-0.07812-0.08594-0.09375-0.1016-0.1094-0.1172
A-1010- -0.125-0.1406-0.1562-0.1719-0.1875-0.2031-0.2188-0.2344-0.25-0.2812-0.3125-0.3438-0.375-0.4062-0.4375-0.4688
B-1011- -0.5-0.5625-0.625-0.6875-0.75-0.8125-0.875-0.9375-1-1.125-1.25-1.375-1.5-1.625-1.75-1.875
C-1100- -2-2.25-2.5-2.75-3-3.25-3.5-3.75-4-4.5-5-5.5-6-6.5-7-7.5
D-1101- -8-9-10-11-12-13-14-15-16-18-20-22-24-26-28-30
E-1110- -32-36-40-44-48-52-56-60-64-72-80-88-96-104-112-120
F-1111- -128-144-160-176-192-208-224-240-256-288-320-352-384-416-448-NaN

FLOAT8E4M3FNUZ

This format has a sign bit + 4 bits of exponent + 3 bits of mantissa. It is similar to the previous one, but it supports only one 0, dedicating what would otherwise mean -0 to represent NaN, hence its name contains "FN" (finite) and "UZ" (unsigned zero). It also has larger exponent bias, so it supports smaller (closer to zero) minimum value at the expense of lower maximum value.

   -0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F
   -0000-0001-0010-0011-0100-0101-0110-0111-1000-1001-1010-1011-1100-1101-1110-1111
0-0000- 00.00097660.0019530.002930.0039060.0048830.0058590.0068360.0078120.0087890.0097660.010740.011720.01270.013670.01465
1-0001- 0.015620.017580.019530.021480.023440.025390.027340.02930.031250.035160.039060.042970.046880.050780.054690.05859
2-0010- 0.06250.070310.078120.085940.093750.10160.10940.11720.1250.14060.15620.17190.18750.20310.21880.2344
3-0011- 0.250.28120.31250.34380.3750.40620.43750.46880.50.56250.6250.68750.750.81250.8750.9375
4-0100- 11.1251.251.3751.51.6251.751.87522.252.52.7533.253.53.75
5-0101- 44.555.566.577.589101112131415
6-0110- 16182022242628303236404448525660
7-0111- 6472808896104112120128144160176192208224240
8-1000- NaN-0.0009766-0.001953-0.00293-0.003906-0.004883-0.005859-0.006836-0.007812-0.008789-0.009766-0.01074-0.01172-0.0127-0.01367-0.01465
9-1001- -0.01562-0.01758-0.01953-0.02148-0.02344-0.02539-0.02734-0.0293-0.03125-0.03516-0.03906-0.04297-0.04688-0.05078-0.05469-0.05859
A-1010- -0.0625-0.07031-0.07812-0.08594-0.09375-0.1016-0.1094-0.1172-0.125-0.1406-0.1562-0.1719-0.1875-0.2031-0.2188-0.2344
B-1011- -0.25-0.2812-0.3125-0.3438-0.375-0.4062-0.4375-0.4688-0.5-0.5625-0.625-0.6875-0.75-0.8125-0.875-0.9375
C-1100- -1-1.125-1.25-1.375-1.5-1.625-1.75-1.875-2-2.25-2.5-2.75-3-3.25-3.5-3.75
D-1101- -4-4.5-5-5.5-6-6.5-7-7.5-8-9-10-11-12-13-14-15
E-1110- -16-18-20-22-24-26-28-30-32-36-40-44-48-52-56-60
F-1111- -64-72-80-88-96-104-112-120-128-144-160-176-192-208-224-240

FLOAT8E5M2

This format has a sign bit + 5 bits of exponent + only 2 bits of mantissa. It fully complies to the IEEE floating-point standard, so it supports +0 and -0, NaNs, and infinites.

   -0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F
   -0000-0001-0010-0011-0100-0101-0110-0111-1000-1001-1010-1011-1100-1101-1110-1111
0-0000- +00.00001530.00003050.00004580.0000610.00007630.00009160.00010680.00012210.00015260.00018310.00021360.00024410.00030520.00036620.0004272
1-0001- 0.00048830.00061040.00073240.00085450.00097660.0012210.0014650.0017090.0019530.0024410.002930.0034180.0039060.0048830.0058590.006836
2-0010- 0.0078120.0097660.011720.013670.015620.019530.023440.027340.031250.039060.046880.054690.06250.078120.093750.1094
3-0011- 0.1250.15620.18750.21880.250.31250.3750.43750.50.6250.750.87511.251.51.75
4-0100- 22.533.54567810121416202428
5-0101- 32404856648096112128160192224256320384448
6-0110- 512640768896102412801536179220482560307235844096512061447168
7-0111- 81921024012288143361638420480245762867232768409604915257344+INF+NaN+NaN+NaN
8-1000- -0-0.0000153-0.0000305-0.0000458-0.000061-0.0000763-0.0000916-0.0001068-0.0001221-0.0001526-0.0001831-0.0002136-0.0002441-0.0003052-0.0003662-0.0004272
9-1001- -0.0004883-0.0006104-0.0007324-0.0008545-0.0009766-0.001221-0.001465-0.001709-0.001953-0.002441-0.00293-0.003418-0.003906-0.004883-0.005859-0.006836
A-1010- -0.007812-0.009766-0.01172-0.01367-0.01562-0.01953-0.02344-0.02734-0.03125-0.03906-0.04688-0.05469-0.0625-0.07812-0.09375-0.1094
B-1011- -0.125-0.1562-0.1875-0.2188-0.25-0.3125-0.375-0.4375-0.5-0.625-0.75-0.875-1-1.25-1.5-1.75
C-1100- -2-2.5-3-3.5-4-5-6-7-8-10-12-14-16-20-24-28
D-1101- -32-40-48-56-64-80-96-112-128-160-192-224-256-320-384-448
E-1110- -512-640-768-896-1024-1280-1536-1792-2048-2560-3072-3584-4096-5120-6144-7168
F-1111- -8192-10240-12288-14336-16384-20480-24576-28672-32768-40960-49152-57344-INF-NaN-NaN-NaN

FLOAT8E5M2FNUZ

This format has a sign bit + 5 bits of exponent + 2 bits of mantissa. Similarly to FLOAT8E4M3FNUZ, it has no infinites, only one zero, and dedicates bit pattern of 0b10000000 to represent NaN.

   -0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F
   -0000-0001-0010-0011-0100-0101-0110-0111-1000-1001-1010-1011-1100-1101-1110-1111
0-0000- 00.00000760.00001530.00002290.00003050.00003810.00004580.00005340.0000610.00007630.00009160.00010680.00012210.00015260.00018310.0002136
1-0001- 0.00024410.00030520.00036620.00042720.00048830.00061040.00073240.00085450.00097660.0012210.0014650.0017090.0019530.0024410.002930.003418
2-0010- 0.0039060.0048830.0058590.0068360.0078120.0097660.011720.013670.015620.019530.023440.027340.031250.039060.046880.05469
3-0011- 0.06250.078120.093750.10940.1250.15620.18750.21880.250.31250.3750.43750.50.6250.750.875
4-0100- 11.251.51.7522.533.545678101214
5-0101- 1620242832404856648096112128160192224
6-0110- 25632038444851264076889610241280153617922048256030723584
7-0111- 409651206144716881921024012288143361638420480245762867232768409604915257344
8-1000- NaN-0.0000076-0.0000153-0.0000229-0.0000305-0.0000381-0.0000458-0.0000534-0.000061-0.0000763-0.0000916-0.0001068-0.0001221-0.0001526-0.0001831-0.0002136
9-1001- -0.0002441-0.0003052-0.0003662-0.0004272-0.0004883-0.0006104-0.0007324-0.0008545-0.0009766-0.001221-0.001465-0.001709-0.001953-0.002441-0.00293-0.003418
A-1010- -0.003906-0.004883-0.005859-0.006836-0.007812-0.009766-0.01172-0.01367-0.01562-0.01953-0.02344-0.02734-0.03125-0.03906-0.04688-0.05469
B-1011- -0.0625-0.07812-0.09375-0.1094-0.125-0.1562-0.1875-0.2188-0.25-0.3125-0.375-0.4375-0.5-0.625-0.75-0.875
C-1100- -1-1.25-1.5-1.75-2-2.5-3-3.5-4-5-6-7-8-10-12-14
D-1101- -16-20-24-28-32-40-48-56-64-80-96-112-128-160-192-224
E-1110- -256-320-384-448-512-640-768-896-1024-1280-1536-1792-2048-2560-3072-3584
F-1111- -4096-5120-6144-7168-8192-10240-12288-14336-16384-20480-24576-28672-32768-40960-49152-57344

Legend

+0Saturated yellow - zero
0.01562Yellow - small numbers, close to zero
448Green - larger positive numbers
-448Red - larger negative numbers
0.001953Gray & italic - denormialized number (no implicit "1.")
12Blue & bold - subsequent integer numbers (smaller numbers support fractions, larger ones start jumping every 2, 4, 8...)
NaNTeal - not a number (NaN)
+INFSaturated green - positive infinity
-INFSaturated red - negative infinity

Capabilities and special values

 FLOAT8E4M3FNFLOAT8E4M3FNUZFLOAT8E5M2FLOAT8E5M2FNUZ
Minimum positive subnormal0.0019530.00097660.00001530.0000076
Minimum positive normal0.015620.0078120.0000610.0000305
Next value after 11.1251.1251.251.25
Maximum integer represented exactly161688
Maximum4482405734457344
Exponent bias781516
0 S 0000 000 0 0000 000 S 00000 00 0 00000 00
NaNS 1111 111 1 0000 000 S 11111 MM 1 00000 00
INF    S 11111 00  

Summary

FP8 data types can represent only 256 possible values, so it is practical to show them all in a table. I hope these tables can be useful to learn about these data types, better understand them, and work with them for those who deal with machine learning workloads on such a low level.

If you want to see the Python script that I developed to generate these colorful tables or play around with it yourself, you can find it on my GitHub: "fp8_tables.py" on github.com/sawickiap/MISC.

Adam Sawicki
Article version 1.1, updated 2024-09-28

Comments

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2024