The Secrets of Floating-Point Numbers

English | Polish

This article was originally published in Polish in issue 5/2024 (115) (November/December 2024) of the Programista magazine.

In this article, I will discuss floating-point numbers compliant with the IEEE 754 standard, which are available in most programming languages. I will describe their structure, capabilities, and limitations. I will also address the common belief that these numbers are inaccurate or nondeterministic. Furthermore, I will highlight many non-obvious pitfalls that await developers who use them. This knowledge can be useful to anyone, regardless of which language or platform you program for!

Floating-point numbers are a brilliant invention. The data types of this kind available in programming languages allow you to store positive and negative numbers, integers and fractions, very small (close to zero) and very large values — all within a relatively small number of bits. Since computers were originally used primarily by scientists for calculations, the history of different methods of encoding numbers is as old as computer science itself. The IEEE 754 standard, which underpins modern floating-point numbers, was introduced in 1985. All the data types we’ll discuss in this article are based on this standard.

As one learns programming and gains experience with a chosen programming language, understanding of floating-point numbers typically progresses through three stages: Initially, the programmer uses such numbers without giving them much thought, assuming they represent arbitrary real numbers. However, they soon run into difficulties and encounter various related errors. This leads to a moment of reflection on the limitations of such numbers. By applying some basic “rules of limited trust” to floating-point numbers, many errors can be avoided. These rules include:

  • “Floating-point numbers are inaccurate.”
  • “Floating-point numbers are nondeterministic.”
  • “Floating-point numbers should not be compared using the == operator.”

However, it’s worth digging deeper into the structure and behavior of these data types. With better understanding, they can be used more consciously, errors can be avoided or minimized, and at the same time, their full potential can be harnessed. This article aims to introduce the reader to the secrets of floating-point numbers, revealing various pitfalls and non-obvious phenomena associated with them.

Basics

The way floating-point numbers work is similar to the scientific notation taught in school physics classes. While the numbers we use in everyday life — like when counting money — can be written plainly, for example as 12,300, in physics we often deal with very large or very small numbers. For instance, the distance from Earth to the Sun is about 150 million kilometers. Instead of writing 150,000,000, it’s often more convenient to write it as 1.5 × 108. This form is a kind of “normalized” notation, where only the first digit appears before the decimal point, and the exponent of 10 is used to shift the decimal point to the right (increasing the numerical value) when the exponent is positive, or to the left (decreasing the value) when it is negative.

Floating-point numbers work in a similar way to this scientific notation, but with binary numbers instead of decimal ones. In computer memory, their structure is a bit more complex than that of integers. Integer types simply use successive bit combinations to represent successive values: 0, 1, 2, … In contrast, floating-point numbers divide the bit sequence into three parts. An example is shown in Figure 1. The data format used here for illustration is a 32-bit floating-point number called “single precision,” which allocates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa.

Figure 1. Decoding a floating-point number

Starting from the least significant bits, the mantissa (also called the significand, shown here in green) encodes the digits after the decimal point. The number is normalized so that a single leading 1 appears before the decimal point. Therefore, this leading 1 can be implied and omitted from memory storage, effectively gaining one extra bit of precision. (The exception is denormalized numbers, which do not have this implicit leading one. We will return to these when discussing special values.)

A number encoded this way lies in the range from 1 to 2. To increase or decrease it by shifting the decimal point, the next part of the encoding holds the exponent (shown here in red), representing a power of two by which the number is multiplied. Using 8 bits interpreted as an integer, the exponent would range from 0 to 255. However, a bias is applied — in this 32-bit format, the bias is 127. This bias is subtracted from the stored exponent value to allow encoding of both positive and negative exponents. Exponent values consisting entirely of zeros or ones are reserved for special values, so the usable exponent range is [1…254] – 127 = [-126…127].

Finally, the most significant bit is the sign bit (shown in blue), which allows for encoding negative numbers. The number represented in Figure 1 as -1s is a clever way of expressing a simple rule: this bit answers the question, “Is there a minus sign?” When the bit is 0, there’s no minus — the number is positive. When the bit is set to 1, we add a minus sign and the number becomes negative.

Note that, since the sign is represented by a separate bit, floating-point numbers are symmetric with respect to zero. To negate a number, you simply flip its most significant bit. To compute the absolute value, just clear that bit. This works differently than with signed integers, which use two’s complement (U2) representation. In signed integers, the most significant bit is also set when the number is negative, but negating the value is not as straightforward.

This structure of floating-point numbers leads to another interesting property: by changing only the sign bit, we can obtain two different representations of zero, known as +0 and -0. These values do actually exist and can carry certain information — for example, whether a number that became zero after a calculation was originally positive or negative. However, since these two values compare as equal, we almost never need to worry about them in practice.

Hardware Support

Despite their complex internal structure, floating-point numbers wouldn't be very useful in programming if computations on them weren’t supported in hardware, allowing for fast execution. In the past, this support was provided by a separate, optional mathematical coprocessor — the Floating Point Unit (FPU) — such as the 8087 chip in early PCs. Today, however, both CPUs and GPUs in modern computers have built-in support for floating-point operations.

As for hardware-supported data types, the most common are 32-bit single-precision and 64-bit double-precision floating-point numbers. Their capabilities are shown in Table 1, and an extended version can be found online in [1]. Their availability and naming may vary across programming languages. For example, C and C++ define the types float and double. Their exact widths aren't strictly defined, but on most platforms, they correspond to the mentioned 32-bit and 64-bit formats. Some scripting languages (including JavaScript and Lua) offer only a single numeric type, implemented using the double format, which may even be used to represent integers. The naming of floating-point types in various programming languages is summarized in Table 2. Some languages also offer types with even greater precision, such as long double in C/C++ or decimal n C#, which are not shown in the table.

Table 1. Capabilities of 32-bit and 64-bit floating-point numbers

Name Single precision Double precision
Bits: total = sign + exponent + mantissa 32 = 1 + 8 + 23 64 = 1 + 11 + 52
Exponent: bias, range 127, -126…127 1023, -1022…1023
Precision — significant decimal digits 7.22 (6…9) 15.95 (15…17)
Smallest denormalized number 2−149 ≈ 1.40 × 10−45 2−1074 ≈ 4.94 × 10−324
Smallest normalized number 2−126 ≈ 1.18 × 10−38 2−1022 ≈ 2.23 × 10−308
Next value after 1 1 + 2−23 ≈ 1.00000012 1 + 2−52 ≈ 1.00000000000000022
Integer numbers represented exactly 0…224 = 16,777,216 0…253 = 9,007,199,254,740,992
Maximum (2 − 2−23) × 2127 ≈ 3.40 × 1038 (2 − 2−52) × 21023 ≈ 1.80 × 10308

Table 2. Floating-point types in programming languages

Language32-bit64-bit
C, C++ *floatdouble
C#floatdouble
Javafloatdouble
Delphi / Object PascalSingleDouble
Python * float
JavaScript Number
Lua ** number

* Typically, depending on the implementation.
** The only supported numeric type.

Are Floating-Point Numbers Inaccurate?

A common belief is that floating-point numbers are inaccurate and therefore should never be compared using the == operator. Following this rule can help avoid many bugs, but it's a simplification. For example, the expression 2.0 + 2.0 will always return 4.0 — never 3.99999. Let's take a closer look at floating-point precision to better understand the topic.

It’s obvious that floating-point numbers are only approximations of real numbers as defined in mathematics. Since they are encoded using a fixed number of bits, they inherently have finite precision. When we talk about precision, we mean the smallest differences in value that a given data type can represent and distinguish. For integers, the concept is straightforward: the precision is always 1 across the entire range, so after the value 5, the next representable value is 6, then 7, and so on.

One can also imagine a format that supports fractional values but is fixed-point — for example, using the higher 8 bits to encode the digits before the decimal point, and the lower 8 bits to encode the digits after it. Such fixed-point formats are not natively supported by typical computer processors, meaning that computations would have to be emulated in software, resulting in slower performance. As a result, they’re rarely used. However, it’s helpful to imagine such a format to see that its precision would be exactly 2-8 = 1/256 ≈ 0.0039 — again, constant throughout its entire range.

In contrast to integers or fixed-point numbers, floating-point numbers have precision that depends on the magnitude of the value. This is symbolically illustrated in Figure 2. The next representable values, marked on the number line, are packed more densely near zero and become increasingly sparse as the absolute value grows. With each increase in the exponent, the precision is halved.

Figure 2. Floating-point numbers on the number line

So, it's more accurate to speak of precision in terms of the number of significant digits. The number of binary digits of precision available is, of course, determined by the number of bits in the mantissa. In decimal terms, this translates to roughly 6 significant digits for the float type and 15 digits for double (according to Table 1). For example, if the value of a 32-bit number is 10.5, the next representable number might be 10.500001, giving a precision of about 1/1000000. However, if the value is larger — say 32,000 — then the precision drops to 0.004. An example in C demonstrating this issue is shown in Listing 1.

Listing 1. Demonstration of finite precision

float x = 1111.111111111111f;
printf("%.16g\n", x);
// Result: 1111.111083984375

It’s worth reading the previous two paragraphs again and making sure to fully understand them, because they imply many important consequences — limitations of floating-point numbers and pitfalls waiting for unsuspecting developers. Let’s address those now.

Representing Integers

First, it’s worth asking how well floating-point numbers can represent integers. It turns out they do this quite well, but only within a limited range. If we ignore the fact that they also support fractions, and we stick solely to integer values — performing only addition, subtraction, multiplication, and rounding the result of division up or down to the nearest integer — then our results will be exact. We can rely on them, compare them with the == operator, and so on. No wonder some scripting languages use 64-bit floating-point as the only available numeric type, even for storing integers.

However, because a floating-point number allocates a fixed number of bits for the sign, exponent, and mantissa, its ability to accurately store integer values is limited compared to an integer type. For example, a signed 32-bit int can hold values up to 2,147,483,648 (over 2 billion). A 32-bit float can represent integers exactly only up to 16,777,216 (over 16 million). Above that, the loss of precision becomes so great that representable values begin to "skip" — first by 2, then by 4, then 8, and so on.

This behavior may be acceptable in many use cases but rules out using floating-point types wherever unit-level precision is critical, such as storing file sizes in bytes, or tracking bank account balances. However, for a 64-bit double , the largest integer value that can be represented exactly is 253, which is more than enough for most use cases — unless you’re dealing with files larger than 8 petabytes (PB).

Finite Precision

Secondly, finite precision means that some numbers simply cannot be represented exactly. For mathematical constants like π (pi) and e, or fractions such as 1/3 = 0.333333…, we already accept that we use approximations, since these values have infinite decimal expansions. However, it turns out that some fractions with finite decimal representations cannot be accurately represented in binary form. A good example is 1/10 = 0.1. Although 0.1 has a finite decimal representation, in binary, it results in a repeating fraction — just like how 1/3 repeats in decimal. Therefore, the value 0.1 cannot be stored exactly in a floating-point number. Listing 2 presents a simple C code snippet that calculates the value 0.1 in two ways and then compares them.

Listing 2. Problem with floating-point number precision

Bad version ❌

float a = 1.0f / 10.0f;
float b = 1.0f - 0.9f;
printf("a=%g, b=%g\n", a, b);
// Result: a=0.1, b=0.1
if (a == b)
    printf("They are equal.\n");
else
    printf("They are not equal!\n"); // <--

Good version ✔

float a = 1.0f / 10.0f;
float b = 1.0f - 0.9f;
printf("a=%g, b=%g\n", a, b);
// Result: a=0.1, b=0.1
if (fabsf(b - a) < 0.000001)
    printf("They are equal.\n"); // <--
else
    printf("They are not equal!\n");

It turns out that even though both variables print as the value 0.1, this program still prints: “They are not equal!” This happens because the two values, although seemingly the same, differ slightly in the less significant bits after the decimal point. If we inspect their binary representation and print them with higher precision, we get:

  • a = 0x3DCCCCCD ≈ 0.10000000149
  • b = 0x3DCCCCD0 ≈ 0.10000002384

The conclusion from this example is that floating-point numbers truly have limited precision, and we should not rely on their accuracy down to the last bit — especially for the results of calculations. However, for constant values directly assigned to variables, we can be confident that the value stored will match the constant as expected — unless it's modified through computation.

So yes, it's true that you should not compare floating-point results using the == operator, as it only returns true if the numbers are exactly identical bit by bit. Instead, a common and safer approach is to check whether the numbers are "close enough" — by calculating the absolute difference using abs function, and checking whether it’s smaller than a tiny value frequently called ε (epsilon). This is demonstrated in the second version of the code in Listing 2, which correctly prints: “They are equal.”

An even more advanced method is offered in Python, using the function math.isclose, where you can specify two tolerance values: abs_tol – absolute tolerance and rel_tol – relative tolerance. The underlying logic checks whether two values a and b are close enough using this formula:

abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)

Catastrophic Cancellation

Thirdly, comparing or subtracting two floating-point numbers of large magnitude can result in a small difference between them being represented inaccurately or completely lost. This phenomenon is known as catastrophic cancellation.

As an example, consider the code in Listing 3, which measures the duration of an operation. On various platforms, we typically have access to a function that returns the number of milliseconds, nanoseconds, or some number of CPU cycles since a particular moment in time (e.g. since system startup or a fixed date). These functions often return a 64-bit integer. If we want to display the elapsed time in seconds, we also need to query the system for the frequency of this timer (i.e. how many ticks per second), and divide the tick count by that frequency. It’s important that this time flow is constant — not affected by the current CPU frequency. The resulting values in seconds are often fractional, so using floating-point numbers seems like a natural choice.

Listing 3. Code measuring a duration of an operation

Bad version ❌

LARGE_INTEGER freq, t;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&t);
double begin = (double)t.QuadPart / (double)freq.QuadPart;
LongOperation();
QueryPerformanceCounter(&t);
double end = (double)t.QuadPart / (double)freq.QuadPart;
double duration = end - begin;
printf("Duration in seconds: %g\n", duration);

Good version ✔

LARGE_INTEGER freq, begin, end;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&begin);
LongOperation();
QueryPerformanceCounter(&end);
double duration = (double)(end.QuadPart - begin.QuadPart) /
    (double)freq.QuadPart;
printf("Duration in seconds: %g\n", duration);

However, there’s a hidden danger in the first version of the code. If the value returned by the timer is large, and the duration of the measured operation is small, then subtracting these two values as floating-point numbers can result in a very imprecise difference, or even cause the result to be zero every time.

The solution is to subtract the starting time from the ending time while they’re still integers, which are precise to the exact cycle. Then, you convert only the difference to a floating-point number. This difference is already small, so precision is preserved. The corrected version of the code is shown in the second version of Listing 3.

Are These Numbers Non-Deterministic?

Another common belief is that floating-point numbers are non-deterministic, meaning we shouldn’t expect the same result every time. Following such a “rule of limited trust” also helps avoid many bugs; however, it's worth examining this topic in more detail. As it turns out, there’s no hidden random number generator operating behind the scenes here. In general, we can say that the same program, running on the same platform with the same input data, should produce the same result. However, there are a few non-obvious factors that can affect the outcome.

First of all, results may differ across platforms. The IEEE 754 standard defines the format of floating-point numbers and the operations on them, but it doesn’t guarantee the exact same result for every operation. For addition or multiplication of values known to be representable exactly, we can expect an accurate result. But more complex operations — especially transcendental functions like sine and cosine — may yield results that differ in the least significant digit (ULP – unit in the last place) across platforms. That means they won't be bitwise identical, and a comparison using the == operator will return false. Such differences can depend on the manufacturer of the processor or graphics chip (e.g., AMD, Intel, Nvidia) or the generation of the hardware — and yet, these results are still valid and compliant with the standard.

Secondly, various optimizations and transformations performed on our code by the compiler can result in different instructions being executed or in a different order, which can affect the result. The traditional floating-point instruction set on the x86 architecture, known as FPU, may store intermediate computation results in higher 80-bit precision. While this might seem advantageous, in practice it can cause various issues if we rely too heavily on exact outcomes of computations.

A similar issue may occur with newer floating-point instruction sets. For example, many CPU and GPU architectures include an instruction called Fused Multiply-Add (FMA) or Fused Multiply-Accumulate (FMAC), which performs a three-argument operation like d = a * b + c in a single step. Logically, this is two operations — multiplication followed by addition — but they’re often executed as a single fused operation. This has a huge impact on performance, because matrix multiplication — which underpins both computer graphics and deep learning — is composed of exactly such multiply-add sequences.

In addition to improved performance, such an instruction often computes the result more accurately than performing the two operations separately, because it rounds the result only once. This can lead to a situation where a given mathematical expression returns one result when the compiler recognizes and optimizes it into a fused FMA instruction, and a different result when the same expression appears elsewhere in the program and is executed as a separate multiplication followed by an addition.

Many compilers offer a feature called Fast Math that influences how floating-point calculations are optimized. For example, in C++ compilers, this is enabled with /fp:fast in Visual Studio and --ffast-math in GCC and Clang. When enabled, it allows the compiler to perform additional transformations that can significantly speed up code. Although these transformations are mathematically valid, they may break compliance with the IEEE 754 standard and produce results that differ from the original, unoptimized version of the expression. The main trade-off is between speed and strict numerical accuracy.

It might seem that enabling this flag is desirable wherever numerical accuracy isn't critical and performance is the top priority — such as in graphics programming. However, even among game developers and creators of graphics or physics engines — who are typically laser-focused on maximizing performance — there’s a widespread belief that Fast Math should be avoided. While it does accelerate code to some extent, it also introduces a host of potential issues. More on this can be found in [2].

On the flip side, disabling Fast Math tells the compiler to preserve the exact order and structure of floating-point operations as written in the source code. This prevents the compiler from applying even the simplest algebraic transformations — ones that a school student might make on paper using basic identities. As a result, it's often up to the programmer to manually simplify and optimize mathematical expressions. A straightforward example is rewriting a polynomial like d = ax2 + bx + c in an optimized form, as shown in Listing 4:

Listing 4. Optimization of a polynomial

// Slowest version
d = a*pow(x, 2.0) + b*x + c;
// Faster version
d = a*x*x + b*x + c;
// Fastest version among presented
d = ((a*x) + b)*x + c;

Finally, and perhaps most alarmingly, the floating-point unit in the processor can be configured with different operating modes that act as a global state for a given thread. These modes can also influence calculation results. For example, for the traditional FPU unit, these modes are set using the _controlfp function, while for newer SSE vector instructions, the flags in the MXCSR register - set via the _mm_setcsr function — affect the behavior. This means that an external library we use could change these modes for our program without us realizing it.

To summarize, given the many factors that can influence calculation results, it’s safer to assume that these results are nondeterministic and not expect exact, consistent, or repeatable values every time. However, these differences do not come out of nowhere. By deeply understanding these issues and carefully preparing the code, it is possible to achieve fully deterministic floating-point calculations — even when writing a physics engine that runs on multiple platforms and in multithreaded environments, as demonstrated by the creators of the Box2D library [3].

Special Values

We mentioned earlier that an exponent consisting entirely of zeros or ones indicates values that are treated in a special way. Let's now take a closer look at them. The types of special floating-point values are shown in Table 3. These special values apply regardless of the specific floating-point type being used — that is, regardless of how many bits the exponent or mantissa has. We also won’t address the sign bit here, assuming that each of these values can exist in both positive and negative forms.

Table 3. Types of special floating-point values

ExponentMantissaMeaningRepresentation
000…0000…0Zero0
000…0Other valueDenormalized numberA numerical value
111…1000…0Infinityinf, Infinity
111…1Other valueNot a NumberNaN, IND

We’ve already discussed the first two. When both the exponent and mantissa consist entirely of zeros, this encodes the value zero. There are two such values depending on the sign bit — namely, -0 and +0 — but since they compare as equal, we usually don’t need to worry about the difference.

Denormalized numbers represent a special mode in which the implicit leading 1 before the decimal point is not assumed during decoding. This allows encoding values even smaller than the smallest normalized numbers. Fortunately, this is handled automatically by the hardware, and programmers rarely need to deal with it directly.

When the exponent consists entirely of ones and the mantissa is all zeros, the value represents infinity (∞). Depending on the sign bit, this can be positive or negative infinity. When converted to a string, it typically appears as "inf" or "Infinity". This value may result from computations that exceed the range of representable numbers — such as when a value grows too large through repeated addition or multiplication, or (more commonly) when dividing by a very small number (close to zero) or by zero itself.

The final category of special values is when the exponent is all ones and the mantissa is non-zero. This encodes a Not a Number (NaN), which indicates an invalid value or a computational error. When converted to a string, it is typically displayed as "NaN" or, more rarely, as "IND" (short for indeterminate).

A distinctive property of NaN is its propagation through expressions. Almost any operation that takes NaN as input will return NaN as output. This means that once an invalid value enters the computation pipeline, subsequent results remain invalid. This situation is illustrated in Figure 3. It is the programmer’s responsibility to trace back through input data or intermediate expressions to determine where the first invalid value appeared. Common causes include negative, zero, or extremely small values where none are expected — or reading garbage data from memory or files and treating it as floating-point numbers.

Figure 3. NaN value propagating through computations

NaN values also exhibit unusual behavior during comparison. Comparing NaN to any other number always returns false — even when compared to itself! Because of this, the condition if (a == a) is the simplest way to check whether a variable holds a numeric value or a NaN. However, standard libraries in many programming languages offer dedicated functions for this purpose. For example, in C, the <math.h> header defines the functions isinf, isnan, and isfinite. The last one returns true if the given value is neither infinity nor NaN.

It’s often assumed that the appearance of INF or NaN in computations signals a bug that needs to be identified and fixed. In most cases, this is indeed true. However, these special values are produced and handled in a way that is mathematically logical and consistent with established rules. Some of these rules are illustrated by the example expressions and their results shown in Listing 5.

Listing 5. The behavior of INF and NaN values

 2 / 0    ==  inf
-2 / 0    == -inf
 0 / 0    ==  NaN
log ( 0)  == -inf
log (-2)  ==  -NaN
sqrt(-2)  ==  -NaN
 2 + inf  ==  inf
 2 * inf  ==  inf
-2 * inf  == -inf
 0 * inf  ==  NaN
 2 / inf  ==  0
 0 / inf  ==  0
inf + inf ==  inf
inf - inf ==  -NaN
inf * inf ==  inf
inf / inf ==  -NaN
 2 + NaN  ==  NaN
 2 * NaN  ==  NaN
 2 / NaN  ==  NaN
 inf > 2  ==  true
-inf < 2  ==  true
(  2 == NaN) == false
(NaN == NaN) == false

For example, it’s worth noting that:

  • Division by zero returns ∞, and dividing a negative number by zero returns -∞.
  • The logarithm of zero yields -∞, but the logarithm of a negative number, as well as the square root of a negative number, results in NaN.
  • Attempting to further increase ∞ by adding or multiplying it with another number still yields ∞, but multiplying it by a negative number returns -∞.
  • In comparisons, -∞ is less than any finite number, and +∞ is greater than any finite number.

Understanding these rules allows us to intentionally make use of special values in our programs. Listing 6 shows a C++ function that finds the maximum value in a given array of float numbers. A temporary variable is used to store the largest value found so far. It is initialized to -∞ at the beginning, which guarantees that any finite number will be greater than it. This implementation has an interesting property: it guarantees a defined behavior when the input array is empty (count == 0) — it returns -∞. This behavior could be documented as part of the function’s specification.

Listing 6. Finding maximum in an array

float FindMaxValue(const float* arr, size_t count)
{
    float maxVal = -std::numeric_limits<float>::infinity();
    for(size_t i = 0; i < count; ++i)
        if(arr[i] > maxVal)
            maxVal = arr[i];
    return maxVal;
}

Finally, it's worth revisiting the issue of division by zero to emphasize one important point. Thanks to the fact that dividing floating-point numbers by zero returns infinity, we have a chance to detect this error in the program and handle it in some way. It's a different story with integers — attempting to divide by zero with integer types results in a critical error that terminates the entire program, so such a division must never be allowed to occur.

Smaller Formats

In the examples throughout this article, we’ve used 32-bit single-precision floating-point numbers and discussed their limitations. We also mentioned 64-bit double-precision numbers. However, for certain applications, even smaller formats can be suitable. One such example is the 16-bit “half-precision” format (also known as half-float or fp16), which allocates 5 bits for the exponent and 10 bits for the mantissa.

This type of data, of course, has very limited precision (approximately 3 significant decimal digits) and a constrained range (with a maximum value of 65,504, beyond which it overflows to infinity). Despite this, the type is used in graphics and is supported in hardware by some GPUs. If we consider that RGB color components of pixels are traditionally stored using 8 bits with values ranging from 0 to 255, the range and precision of fp16 values turn out to be more than sufficient for representing colors — even on displays with HDR support.

Using smaller data formats wherever their capabilities are sufficient can save memory and accelerate computations. For instance, a processing unit capable of executing an operation on a vector of four 32-bit floating-point numbers in a single cycle might, using the same time and transistor resources, perform the same operation on eight 16-bit numbers packed into a register of the same width. Memory savings are especially important for data transfer, since in many modern algorithms it is memory bandwidth — not raw computation speed — that becomes the limiting factor for performance.

Algorithms in artificial intelligence, particularly deep learning and the associated neural networks, have an even more regular computational structure and process even larger volumes of data than graphics applications. Researchers in the field have demonstrated in academic papers that various types of neural network models can operate effectively with lower-precision arithmetic. In fact, for some operations (especially training, which involves storing gradients), having a wider dynamic range is more important than precision. As a result, a second 16-bit format called bf16 (brain float) was introduced. It allocates 8 bits for the exponent and 7 bits for the mantissa, prioritizing range over precision.

In AI computations, values are sometimes quantized into integer types (e.g., int8, int4, or even… 1-bit numbers). However, floating-point formats as small as 8 bits have also been proposed. These include fp8 and bf8, which respectively use 4 bits for the exponent and 3 for the mantissa (E4M3), or 5 bits for the exponent and only 2 for the mantissa (E5M2). More information about these can be found in [4]. Hardware support for such types is currently not widespread, except in specialized chips designed specifically for AI workloads.

Summary

Floating-point numbers are used in programming wherever we need to represent not just integers, but also fractions, extremely small, or very large values. These data types go by different names in different programming languages, but they are typically based on the same IEEE 754 standard formats, with hardware support provided by CPUs and GPUs.

Understanding how floating-point numbers are structured and how they work — their capabilities and limitations — allows us to use them more deliberately and avoid the various pitfalls that can trap beginners or inattentive developers. In this article, we’ve explored many of these aspects: limited precision, nondeterministic results, supported special values (INF, NaN), and how these behave.

There’s still much more to explore. We haven’t touched on computational performance, for instance. Even when hardware provides instructions for computing various functions, some operations are faster than others. Basic operations like addition, subtraction, and multiplication are typically the fastest. In contrast, transcendental functions such as sine, cosine, power, square root, logarithm — and even division — may require more CPU cycles to complete. We also didn’t cover vector instruction sets available in modern processors — MMX, SSE, AVX — that allow the same operation to be applied to multiple numbers simultaneously, significantly accelerating computations.

Denormalized numbers, which allow for extremely small values, can also be slower on some platforms. Some architectures offer modes that disable denormalized number support entirely, replacing such values with zero (a behavior known as flush to zero). Similarly, many platforms support different rounding modes — toward negative infinity, positive infinity, zero, or the nearest even number. And finally, we didn’t mention the distinction between quiet NaNs and signaling NaNs.

Bibliography

Adam Sawicki
November 2024

This article has been translated from Polish with help of the free version of ChatGPT.

Comments

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2025