Entries for tag "math", ordered from most recent. Entry count: 62.

Warning! Some information on this page is older than 6 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

# Myths About Floating-Point Numbers

Wed

17

Mar 2021

Floating-point numbers are a great invention in computer science, but they can also be tricky and troublesome to use correctly. I’ve written about them already by publishing Floating-Point Formats Cheatsheet and presentation “Pitfalls of floating-point numbers” (“Pułapki liczb zmiennoprzecinkowych” – the slides are in Polish). Last year I was preparing for a more extensive talk about this topic, but it got cancelled, like pretty much everything in these hard times of the COVID-19 pandemic. So in this post, I would like to approach this topic from a different angle.

A programmer can use floating-point numbers on different levels of understanding. A beginner would use them, trusting they are infinitely capable and precise, which can lead to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices the problems can be avoided. An advanced programmer understands what is really going on inside these numbers and can use them with a full awareness of what to expect from them. This post may help you jump from step 2 to step 3. Commonly adopted good practices are called “myths” here, but they are actually just generalizations and simplifications. They can be useful for avoiding errors, unless you understand what is true and what is false about them on a deeper level.

**1. They are not exact**

It is not true that 2.0 + 2.0 can give 3.99999. It will always be 4.0. They are exact to the extent of their limited range and precision. If you assign a floating-point number some constant value, you can safely compare it with the same value later, even using the discouraged operator `==`

, as long as it is not a result of some calculations. Imprecisions don't come out of nowhere.

Instead of using integer loop iterator and converting it to float every time:

for(size_t i = 0; i < count; ++i)

{

float f = (float)i;

// Use f

}

You can do this, which will result in a much more efficient code:

for(float f = 0.f; f < (float)count; f += 1.f)

{

// Use f

}

It is true, however, that your numbers may not look exactly as expected because:

- Some fractions cannot be represented exactly – even some simple ones like decimal 0.1, which is binary 0.0001101… This is because we humans normally use decimal system, while floating-point numbers, like other numbers inside computers, use binary system – a different base.
- There is a limited range of integer numbers that can be represented exactly. For 32-bit floats it is only 16,777,216. Above that, numbers start “jumping” every 2, then every 4, etc. So it is not a good idea to use floating-point numbers to represent file sizes if your files are bigger than 16 MB. If
`count`

in the example above was >16M, it would cause an infinite loop.

64-bit “double”, however, represents integers exactly up to 9,007,199,254,740,992, so it should be enough for most applications. No wonder that some scripting languages do just fine while supporting only “double” floating-point numbers and no integers at all.

**2. They are non-deterministic**

It is not true that cosmic radiation will flip the least significant bit at random. Random number generators are also not involved. If you call the same function with your floating-point calculations with same input, you will get the same output. It is fully deterministic, like other computing. (Note: When old FPU instructions are generated rather than new SSE, this can be really non-deterministic and even a task switch may alter your numbers. See this tweet.)

It is true, however, that you may observe different results because:

- Compiler optimizations can influence the result. If you implement two versions of your formula, similar but not exactly the same, the compiler may, for example, optimize (a * b + c) from doing MUL + ADD to FMA (fused multiply-add) instruction, which does the 3-argument operation in one step. FMA has higher precision, but can then give a different result than two separate instructions.
- You may observe different results on different platforms – e.g. AMD vs Intel CPU or AMD vs NVIDIA GPU. This is because floating-point standard (IEEE 754) defines only required precision of operations like sin, cos, etc., so the exact result may vary on the least significant bit.

I heard a story of a developer who tried to calculate hashes from the results of his floating-point calculations in a distributed system and discovered that records with what was supposed to be same data had different hashes on different machines.

I once had to investigate a user complaint about a following piece of shader code (in GLSL language). The user said that on AMD graphics cards for uv.x higher than 306 it always returns black color (zero).

vec4 fragColor = vec4(vec3(fract(sin(uv.x * 2300.0 * 12000.0))), 1.0);

I noticed that the value passed to sine function is very high. For uv.x = 306 it is 27,600,000. If we recall from math classes that sine cycles between -1 and 1 every 2*PI ≈ 6.283185 and we take into consideration that above 16,777,216 a 32-bit float cannot represent all integer numbers exactly, but start jumping every 2, then every 4 etc., we can conclude that we have not enough precision to know whether our result should be -1, 1, or anything in between. It is just undefined.

I then asked the user what is he trying to achieve with this code, as the result is totally random. He said it is indeed suppposed to be... a random number generator. The problem is that the result being always 0 is as valid as any other. The reason random numbers are generated on NVIDIA cards and not on AMD is that sine instruction on AMD GPU architectures actually has period of 1, not 2*PI. But it is still fully deterministic in regards to input value. It just returns different results between different platforms.

**3. NaN and INF are indication of an error**

It is true that if you don’t expect them, their appearance may indicate an error, either in your formulas or in input data (e.g. numbers very large, very small and close to zero, or just garbage binary data). It is also true that they can cause trouble as they propagate through calculations, e.g. every operation with NaN returns NaN.

However, it is not true that these special values are just a means of returning error or that they are not useful. They are perfectly valid special cases of the floating-point representation and have clearly defined behavior. For example, -INF is smaller and +INF is larger than any finite number. You can use this property to implement following function with a clearly documented interface:

#include <limits> // Finds and returns maximum number from given array.

// For empty array returns -INF.

float CalculateMax(const float* a, size_t count)

{

float max = -std::numeric_limits<float>::infinity();

for(size_t i = 0; i < count; ++i)

if(a[i] > max)

max = a[i];

return max;

}

**Summary**

As you can see, common beliefs about floating-point numbers - that they are not exact, non-deterministic, or that NaN and INF are an indication of an error, are some generalizations and simplifications that can help to avoid errors, but they don’t tell the full story. To really understand what's going on on a deeper level:

- Keep in mind which values in your program are just input data or constants and which are results of some calculations.
- Know the capabilities and limitations of floating point types - their maximum range, minimum possible number, precision in terms of binary or decimal places, maximum interger represented exactly etc.
- Learn about how floating point numbers are stored, bit by bit.
- Learn about special values - INF, NaN, positive and negative zero, denormals. Understand how they behave in computations.
- Take a look at assembly generated by the compiler to see how CPU or GPU really operates on your numbers.

Update 2021-06-09: This article has been published as a guest post on C++ Stories and spawned an interesting discussion on Reddit that is worth reading.

# Bezier Curve as Easing Function

Sat

24

Oct 2020

Bézier curves are named after Pierre Bézier, and primary used is geometry modeling. They are good at describing various shapes in 2D and 3D. A Bézier curve is a function x(t), y(t) - it gives points in space (x, y) for some parameter t = 0..1. But nowadays they are also used in computer graphics for animation, as easing functions. There, we need to evaluate y(x), because x is the time parameter and y is the evaluated variable.

How does the formula of a Bézier curve look like as y(x)? What constraints do the 4 control points need to meet for this function to be correct - to have only one value of y for each x, with no loops or arcs? Finally, how can this function be approximated to store it in computer memory and evaluate it efficiently in modern game engines? These sound like fundamental questions, but apparently no one researched this topic thoroughly before, so it became the subject of the Ph.D. thesis of my friend Łukasz Izdebski.

A part of his research has just been published as paper "Bézier Curve as a Generalization of the Easing Function in Computer Animation" in Advances in Computer Graphics, 37th Computer Graphics International Conference, CGI 2020, Geneva, Switzerland. We want to share an excerpt of his findings online as an article: Bezier Curve as Easing Function.

Comments | #math #rendering Share

# Improving the quality of the alpha test (cutout) materials

Fri

24

Apr 2020

*This is a guest post from my friend Łukasz Izdebski Ph.D.*

Today I want to share with you a trick which my collage from previews work mentioned to me a long time ago. It's about alpha tested (also known as cutout) materials. This technique which I want to share with you consists of two neat tricks that can improve the quality of alpha tested (cutout) materials.

Alpha test is an old technique used in computer graphics. The idea behind it is very simple. In a very basic form, a material (shader) of a rendered object can discard processed pixels based on the alpha channel of RGBA texture. When shaded pixel’s final alpha value is less than this threshold value (threshold value is constant for the instance of the material and a typical value is 50%), it is clipped (discarded) and will not land in the shaders output framebuffer. These types of materials are commonly used to render vegetation, fences, impostors/billboards, etc.

Alpha tested materials have some, I will say a little issue. It can be noticed when rendered object (with this material) is far away from the camera. Let the following video below be an example of this issue.

Comments | #rendering #math Share

# How to Correctly Interpolate Vertex Attributes on a Parallelogram Using Modern GPUs?

Thu

13

Feb 2020

*This is probably first guest post ever on my blog. It has been written by my friend Łukasz Izdebski Ph.D.*

In a nutshell, today’s graphics cards render meshes using only triangles, as shown on the picture below.

I don’t want to describe how it is done (a lot of information on this topic can be easily found on the Internet) and describe the whole graphics pipeline, but for a short recap: In the first programmable part of the rendering pipeline, vertex shader receives a single vertex (with assigned data about it that I will later refer to) and outputs one vertex which is transformed by the shader. Usually it is a transformation from 3D coordinate system to Normalized Device Coordinates (NDC) (information about NDC can be found in Coordinate Systems). After primitive clipping, perspective divide, and viewport transform, vertices are projected on the 2D screen, which is drawn to the monitor output.

But this is only half the story. I was describing what is happening with vertices, but at the beginning I mentioned triangles, with three vertices forming one triangle. After vertex processing, Primitive Assembly follows, then the next stage is Rasterization. This is very important because in this stage the generation of fragments (pixels) happens, which are lying inside the triangle, as shown on the picture below.

How are colors of the pixels inside the triangle generated? Each vertex can contain not only one set of data - 3D coordinates in the virtual world, but also additional data called attributes. Those attributes can be color of the vertex, normal vector, texture coordinate etc.

How then those rainbow colors are generated as shown on the picture above, while as I said only one color can be set on three vertices of the triangle? The answer is interpolation. As we can read in Wikipedia, interpolation in mathematics is a type of estimation, a method that can be used to generate new data points between a discrete set of known data points. In the described problem it’s about generating interpolated colors inside the rendered triangle.

The way to achieve this is by using barycentric coordinates. (Those coordinates not only can be used for interpolation but additionally define if a fragment lies inside the triangle or not. More on this topic can be read in Rasterization: a Practical Implementation). In short, barycentric coordinate system is a set of three points λ_{1}, λ_{2}, λ_{3} ≥ 0 (for convex polygons) where λ_{1} + λ_{2} + λ_{3} = 1. When a triangle is rasterized, for every fragment of the triangle proper barycentric coordinates are calculated. Then the color of the fragment C can be calculated by weighted sum of colors at each vertex C_{1}, C_{2}, C_{3}, and weights are of course barycentric coordinates: C = C_{1} * λ_{1} + C_{2} * λ_{2} + C_{3} * λ_{3}

This way not only color can be interpolated, but any vertex attribute. (This functionality can be disabled by using proper Interpolation Qualifier on an attribute in vertex shader source code, like `flat`

).

When dealing with triangle meshes, this way of attribute interpolation gives correct results, but when rendering 2D sprites or font glyphs, some disadvantages may occur under specific circumstances. When we want to render a gradient which starts in one of the corners of a sprite (see the picture below), we can see quite ugly results :(

This is happening because interpolation occurs on two triangles independent of each other. Graphics cards can work only on triangles, not quads. In this case we want the interpolation to occur on a quad not triangle, as pictured below:

How to trick the graphics card and force quadrilateral attributes interpolation? One way to render such type of gradients is by using tessellation – subdivide quad geometry. Tessellation shader is available starting from DirectX 11, OpenGL 4.0, and Vulkan 1.0. A simple example of how it will look like depending on different parameters of tessellation (more details about tessellation can be found in Tessellation Stages) can be seen in the animated picture below.

As we can see, when the quad is subdivided to more than 16 pieces, we are getting the desired visual result, but as my high school teacher used to say: “Don’t shoot a fly with a cannon”, and this solution for rendering such a simple thing by using tessellation is an overkill.

This way I developed a new technique that will be helpful to achieve this goal. First, we need to get access to the barycentric coordinates in the fragment shader. DirectX 12 HLSL gives us those coordinates when using SV_Barycentrics. In Vulkan those coordinates are available in AMD VK_AMD_shader_explicit_vertex_parameter extension and Nvidia VK_NV_fragment_shader_barycentric extension. Maybe in the near future it will be available in the core spec of Vulkan and more importantly from all hardware vendors.

If we are not fortunate to have those coordinates as built-in functions, we can generate them by adding some additional data: one new vertex attribute and one new uniform (constant) value. Here are the details of this solution. Consider a quadrilateral built from four vertices and two triangles as shown in the picture below.

Additional attribute `Barycentric`

in vertices is a 2D vector and should contain following values:

A = (1,0) B = (0,0) C = (0,1) D = (0,0)

Next step is to calculate extra constant data for the parameter to be interpolated as shown in the picture above (in this case color attribute), using the equation:

ExtraColorData = - ColorAtVertexA + ColorAtVertexB - ColorAtVertexC + ColorAtVertexD

The fragment shader that renders the interpolation we are looking for looks like this:

///////////////////////////////////////////////////////////////////// //GLSL Fragment Shader ///////////////////////////////////////////////////////////////////// #version 450 #extension GL_ARB_separate_shader_objects : enable layout(binding = 0) uniform CONSTANT_BUFFER { vec4 ExtraColorData; } cbuffer; in block { vec4 Color; vec2 Barycentric; } PSInput; layout(location = 0) out vec4 SV_TARGET; void main() { SV_TARGET = PSInput.Color + PSInput.Barycentric.x * PSInput.Barycentric.y * cbuffer.ExtraColorData; } ///////////////////////////////////////////////////////////////////// //HLSL Pixel Shader ///////////////////////////////////////////////////////////////////// cbuffer CONSTANT_BUFFER : register(b0) { float4 ExtraColorData; }; struct PSInput { float4 color : COLOR; float2 barycentric : TEXCOORD0; }; float4 PSMain(PSInput input) : SV_TARGET { return input.color + input.barycentric.x * input.barycentric.y * ExtraColorData; }

That’s all. As we can see, it should not have a big performance overhead. When barycentric coordinates will be more available, then memory overhead will also be minimal.

Probably the reader will ask the question **if this is looking good in the 3D perspective scenario, not only when a triangle is parallel to the screen?**

As shown in the picture above, hardware properly interpolates data using additional computation as describe here (Rasterization: a Practical Implementation), so this new method works with perspective as it should.

**Does this method give proper results only on squares?**

This is a good question! The solution described above works on all parallelograms. I’m now working on a solution for all convex quadrilaterals.

**What else can we use this method for?**

One more usage comes to my mind: a post-process fullscreen quad. As I mentioned earlier, graphics cards do not render quads, but triangles. To simulate proper interpolation of attributes, 3D engines render one BIG triangle which covers the whole screen. With this new approach, rendering quad built from two triangles can be available and attributes which are needed to be quadrilateral interpolated can be calculated in the way shown above.

Comments | #math #rendering Share

# Operations on power of two numbers

Sun

09

Sep 2018

Numbers that are powers of two (i.e. 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 and so on...) are especially important in programming, due to the way computers work - they operate on binary representation. Sometimes there is a need to ensure that certain number is power of two. For example, it might be important for size and alignment of some memory blocks. This property simplifies operations on such quantities - they can be manipulated using bitwise operations instead of arithmetic ones.

In this post I'd like to present efficient algorithms for 3 common operations on power-of-2 numbers, in C++. I do it just to gather them in one place, because they can be easily found in many other places all around the Internet. These operations can be implemented using other algorithms as well. Most obvious implementation would involve a loop over bits, but that would give O(n) time complexity relative to the number of bits in operand type. Following algorithms use clever bit tricks to be more efficient. They have constant or logarithmic time and they don't use any flow control.

**1. Check if a number is a power of two.** Examples:

IsPow2(0) == true (!!) IsPow2(1) == true IsPow2(2) == true IsPow2(3) == false IsPow2(4) == true IsPow2(123) == false IsPow2(128) == true IsPow2(129) == false

This one I know off the top of my head. The trick here is based on an observation that a number is power of two when its binary representation has exactly one bit set, e.g. 128 = 0b10000000. If you decrement it, all less significant bits become set: 127 = 0b1111111. Bitwise AND checks if these two numbers have no bits set in common.

template <typename T> bool IsPow2(T x) { return (x & (x-1)) == 0; }

**2. Find smallest power of two greater or equal to given number.** Examples:

NextPow2(0) == 0 NextPow2(1) == 1 NextPow2(2) == 2 NextPow2(3) == 4 NextPow2(4) == 4 NextPow2(123) == 128 NextPow2(128) == 128 NextPow2(129) == 256

This one I had in my library for a long time.

uint32_t NextPow2(uint32_t v) { v--; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v++; return v; } uint64_t NextPow2(uint64_t v) { v--; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v |= v >> 32; v++; return v; }

**3. Find largest power of two less or equal to given number.** Examples:

PrevPow2(0) == 0 PrevPow2(1) == 1 PrevPow2(2) == 2 PrevPow2(3) == 2 PrevPow2(4) == 4 PrevPow2(123) == 64 PrevPow2(128) == 128 PrevPow2(129) == 128

I needed this one just recently and it took me a while to find it on Google. Finally, I found it in this post on StackOveflow.

uint32_t PrevPow2(uint32_t v) { v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v = v ^ (v >> 1); return v; } uint64_t PrevPow2(uint64_t v) { v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v |= v >> 32; v = v ^ (v >> 1); return v; }

**Update 2018-09-10:** As I've been notified on Twitter, C++20 is also getting such functions as standard header <bit>.

Comments | #math #c++ #algorithms Share

# Pitfalls of Floating-Point Numbers - Slides

Sat

24

Sep 2016

Just as I announced in my previous blog post, today I gave a lecture on a "Kariera IT" event - organized by CareerCon, dedicated to IT jobs.

Here you can find slides from my presentation, in Polish. It's called **"Pułapki liczb zmiennoprzecinkowych"** ("Pitfalls of floating-point numbers").

Here are links to the Floating-Point Formats Cheatsheet (in English) that I mentioned in my presentation:

Comments | #events #teaching #math Share

# Watch out for reduced precision normalize/length in OpenGL ES

Mon

27

Apr 2015

GLSL language for OpenGL ES introduces concept of precision. You can annotate variable declaration (both float and int/uint) with a precision qualifier: `highp`

, `mediump`

or `lowp`

, like:

mediump float a = 3.0;

You can also specify default precision qualifier by using precision statement. Language specification defines minimum required range and precision for each precision qualifier.

`highp`

basically means normal, single-precision, 32-bit float (IEEE 754), as we know it from CPU programming.`mediump`

is said to have have range of at least -2^14 ... 2^14 and relative precision 2^-10, so it can be, for example, implemented using a 16-bit, half-precision float.`lowp`

is said to have range at least -2 ... 2 and absolute precision 2^-8, so basically it can be stored as a 10-bit, fixed-point number.

GPU vendors are free to use more precise data types, or even full 32-bit float for all of them. What exact precision is used depends on specific GPU and maybe even operating system or graphics driver version. Using smaller data types can occupy less memory, calculate faster and consume less battery power. But it comes at the price of reduced precision and range of these numbers. Tom Olson wrote interesting articles about this: "Benchmarking floating-point precision in mobile GPUs": Part I, Part II, Part III.

In this post I'd like to warn you against a specific problem related to it - usage of `length()`

, `normalize()`

and `distance()`

functions. Using smaller data types not only limits precision in terms of number of significant digits, but also available range (over which the value will saturate to -INF/+INF). For mediump, this range is defined as +/-2^14, which is only 16384.

This may still look like a lot, but let's remember that calculating vector length involves intermediate value that it sum of squares of this components. This can grow very big before a square root is applied. For example, for 3D vector:

length(a) = sqrt(a.x*a.x + a.y*a.y + a.z*a.z)

If you do this operation on a mediump vector, the term `a.x*a.x + a.y*a.y + a.z*a.z`

can exceed maximum value for vector as small as (74.0, 74.0, 74.0). It can be very dangerous if you do something like this in your fragment shader:

precision mediump float;

uniform vec3 light_pos;

...

void main()

{

...

vec3 dir_to_light = normalize(pos - light_pos);

// Calculate your lighting and so on...

You might ask: Why isn't this intermediate value stored in high precision before taking its square root to avoid this overflow problem? Obviously it could be, as precision in any place of the shader is free to be higher than the minimum allowed in that place, so some GPU vendors can do it this way, but you shouldn't rely on this. GLSL specification clearly says that the shader is free to use same, reduced precision for intermediate values.

The precision used to internally evaluate an operation, and the precision qualification subsequently associated with any resulting intermediate values, must be at least as high as the highest precision qualification of the operands consumed by the operation.

Conclusion is: When you write shaders for OpenGL ES, watch out for operations that involve calculating vector length (or dot product) like `length()`

, `normalize()`

, `distance()`

, use `highp`

precision for vectors involved and remember that what works on one GPU due to using precision higher than minimum required, may not work on another GPU and it’s still an application issue.

Comments | #math #opengl Share

# Floating-Point Formats Cheatsheet

Thu

13

Jun 2013

Floating-point numbers (or floats in short) are not as simple as integer numbers. There is much to be understood when dealing with these numbers on low level - basic things like the sign + exponent + significand representation (and that exponent is biased, while significand has implicit leading 1), why you should never compare calculation results operator ==, that some fractions with finite decimal representation cannot be represented exactly in binary etc., as well as why there are two zeros -0 and +1, what are infinite, NaN (Not a Number) and denorm (denormal numbers) and how they behave. I won't describe it here. It's not an arcane knowledge - you can find many information about this on the Web, starting from Wikipedia article.

But after you understand these concepts, quantitative questions come to mind, like: how many significant decimal digits can we expect from precision of particular float representation (half, single, double)? What is the minimum non-zero value representable in that format? What range of integers can we represent exactly? What is the maximum value? And finally: if our game crashes with "Access violation, reading location 0x3f800000", what chances are that we mistaken pointer for a float number, as this is one of common values, meaning 1.0?

So to organize such knowledge, I created a "Floating-Point Formats" cheatsheet:

Copyright © 2004-2023