# First Look at New D3D12 Enhanced Barriers

Dec 2021

This will be pretty advanced or at least intermediate article. It assumes you know Direct3D 12 API. Some references to Vulkan may also appear. I am writing it because I just found out that yesterday Microsoft announced an upcoming big change in D3D12: Enhanced Barriers. It will be an addition to the API that provides a new way to do barriers. Considering my professional interests, this looks very important to me and also quite revolutionary. This article summarizes my first look and my thoughts about this new addition to the API or, speaking in terms of modern internet, my "unboxing" or "reaction" ;)

Bill Kristiansen, the author of the article linked above, written that currently only the software-simulated WARP device supports the new enhanced barriers. Support in real GPU drivers will come at later time. The new barriers can replace the old way of doing them, but both will still be available and can also be mixed in one application. Which means this is not as big revolution to turn our DirectX development upside down - we can switch to them gradually. For now we can just prepare ourselves for the future by studying the interface (which I do in this article) and testing some code using WARP device.

UPDATE 2021-12-10: I just learned that Microsoft actually did publish a documentation of the new API: Enhanced Barriers @ DirectX-Specs, so I recommend to go see it before reading this article.

Read full entry > | Comments | #directx #vulkan #gpu Share

# Understanding Graphs in GPUView and RGP

Nov 2021

When optimizing performance of a game or some other program, the most important thing is to get hard data first – to profile it using some tools, to see what is happening and where to focus attention. There are many profiling tools available. When talking about graphics, we realize that GPU is really a co-processor that can execute submitted work at its own pace, therefore GPU profiling tools offer a specific type of graph to visualize it. In this article, I will explain how to read this type of graph.

Let's take Radeon GPU Profiler (RGP) as an example. This program is available for free and is compatible with AMD graphics cards. It can capture data from programs that use Direct3D 12 or Vulkan. When we open a capture file and go to Overview > Frame summary tab, we can see a graph like this one:

It may look scary at first glance, but don't worry and stay with me. I will explain everything step-by-step. I don't know if there is any name for this type of graph, so let's call it a "queue graph" because it shows a queue of tasks submitted to the graphics card and executed by it.

The horizontal axis is time, passing in the right direction at a constant pace. The vertical axis is the queue, with the front of the queue on the bottom and items enqueued later stacked on top.

At each point in time, the item on the bottom row is the one currently executing on the GPU. Everything above this row is waiting for its turn. It means that from the graph we can see and measure when a certain piece of work (like D3D12 ExecuteCommandLists call in this example) was enqueued, when it started executing. and how long it took to execute it. The width of the bottom block represents the amount of time that was required to execute. Note that the work item going “down the stairs” has no meaning in itself. It just means something in front of it finished, so the queue ahead is shorter. Only when it ends up in the bottom row, it really starts executing.

Another thing to note is that some items wait in the queue but don't take any significant time to execute. These are simple and quick commands, like the green call to the Signal function marked here. When everything in front of it completes, it also completes in no time.

We can make more observations from this graph if we consider the fact that games work with frames, each frame executes commands to draw the whole image from clearing background through 3D objects to UI and finishes with a call to the Present function, marked here in brown color. By looking for this type of item, we can conclude when a new frame begins. For example, in the point "A" the GPU is still executing commands of frame N, while we have all commands for the next frame N+1 enqueued, including its next Present, and also the commands for frame N+2 are stacking up at the end of the queue. Thus, we can expect the game to have 2 frames of latency in displaying the image.

The same type of graph is used by GPUView - a free tool from Microsoft that can record and display what is happening in the system on a very low level. (The linked article is very old - right now the way to install the tool is to grab Windows Assessment and Deployment Kit (Windows ADK) and a convenient UI for it is UIforETW). As you can see here, both "3D Hardware Queue" of my graphics card and software "Device Context" of a running game show packets of work submitted for rendering.

One important piece of information that we can extract from this graph is that GPU is not busy 100% of the time. GPUView actually shows the number on the right, which is 77.89% for the current view. It means the game is not GPU-bound. Reducing graphics quality settings would not increase framerate (FPS). This often happens when the game does some heavy computations on the CPU or when it reaches 60 FPS and we have V-sync enabled. Here we have the latter case, as we can see moments of vertical synchronization marked as blue lines, while rendering each frame seems to be blocked until that moment.

Note the graph described here is not the same as flame graphs or flame charts, which show a hierarchy of nested things, not a queue. For example, a call stack of function calls.

Comments | #optimization #tools Share

# My Favorite Windows Apps in 2021

Oct 2021

Last time I showed the list of my favorite apps for the PC was in May 2009 - 12 years go, so maybe it's time post a new one :) If you know a better alternative to any of these programs, please post a comment below.


Before I start with my list, I would like to stress how much the landscape of PC applications changed throughout these years. Back then, many kinds of programs (e.g. for video editing) were very expensive and had no good and free alternative. Among simpler apps, "shareware" was still a thing, so these also required a small fee (or downloading a crack :) Today, we have many excellent programs available for free. All the programs I list below are free unless explicitly mentioned.

With free programs, we have to be careful though. Some of them are free only for non-commercial use, so they shouldn't be installed on a machine provided by your employer and used for work. Examples are HWiNFO or FastStone Image Viewer. The ones licensed under GNU GPL can be freely installed and used for any purpose. It has nothing to do with the availability of the source code. We won't download the code and compile the program by ourselves, anyway. This free software/open source license also guarantees freedom to use program any way we want. With apps coming for free under a custom license (commonly referred as "freeware") this is not necessarily the case, so to be fully compliant you should always check the license (and/or ask your IT department) before installing anything on a company laptop.

There is also a trap awaiting these who download and instal new apps that many websites take free apps and repack them into their own installers, adding some malware, adware, or other unwanted software. They are often positioned higher in Google search results than the original app developer. To make sure you download the right installer, always go to the original website and not any of these app-aggregating portals. Also, be careful which "DOWNLOAD" button you click. An extreme example of developer's greed is FileZilla, which is free software licensed under GPL, but the original website hosts an installer that "may include bundled offers" and hides real installer for the app alone under "Show additional download options" link.

Read full entry > | Comments | #windows Share

# Creative Use of GPU Fixed-Function Hardware

Sep 2021

I recently broke my rule of posting on my blog at least once a month as I had some other topics and problems to handle in my life, but I'm still alive, still doing graphics programming for a living, so I hope to get back to blogging now. This post is more like a question rather than an answer. It is about creative use of GPU fixed-function hardware. Warning: It may be pretty difficult for beginners, full of graphics programming terms you should already know to understand it. But first, here is some background:

I remember the times when graphics cards were only configurable, not programmable. There were no shaders, only a set of parameters that could control pre-defined operations - transform of vertices, texturing and lighting of pixels. Then, shaders appeared. They evolved by supporting more instructions to be executed and a wider variety of instructions available. At some point, even before the invention of compute shaders, the term “general-purpose computing on GPU” (GPGPU) appeared. Developers started encoding some data as RGBA colors of texture pixels and drawing full-screen quads just to launch calculation of some non-graphical tasks, implemented as pixel shaders. Soon after, compute shaders appeared, so they no longer need to pretend anything - they can now spawn a set of threads that can just read and write memory freely through Direct3D unordered access views aka Vulkan storage images and buffers.

GPUs seem to become more universal over time, with more and more workloads done as compute shaders these days. Will we end up with some generic, highly parallel compute machines with no fixed-function hardware? I don’t know. But Nanite technology from the new Unreal Engine 5 makes a step in this direction by implementing its own rasterizer for some of its triangles, in form of a compute shader. I recommend a good article about it: “A Macro View of Nanite – The Code Corsair” (it seems the link is broken already - here is a copy on Wayback Machine Internet Archive). Apparently, for tiny triangles of around single pixel size, custom rasterization is faster than what GPUs provide by default.

But in the same article we can read that Epic also does something opposite in Nanite: they use some fixed-function parts of the graphics pipeline very creatively. When applying materials in screen space, they render a full-screen pass per each material, but instead of drawing just a full-screen triangle, they do a regular triangle grid with quads covering tiles of NxN pixels. They then perform a coarse-grained culling of these tiles in a vertex shader. In order to reject one, they output vertex position = NaN, which makes a triangle incorrect and not spawning any pixels. Then, a more fine-grained culling is performed using Z-test. Per-pixel material identifier is encoded as depth in a depth buffer! This can be fast, as modern GPUs apply “HiZ” - an internal optimization to reject whole groups of pixels that fail Z-test even before their pixel shaders are launched.

This reminded me of another creative use of the graphics pipeline I observed in one game a few years ago. That pass was calculating luminance histogram of a scene. They also rendered a regular grid of geometry in screen space, but with “point list” topology. Each vertex was sampling and calculating average luminance of its region. On the other end, the histogram texture of Nx1 pixels was bound as a render target. Measured luminance of a region was returned as vertex position, while incrementation of the specific place on the histogram was ensured using additive blending. I suspect this is not the most optimal way of doing this, a compute shader using atomics could probably do it faster, but it surely was very creative and took me some time to figure out what that pass is really doing and how is it doing it.

After all, GPUs have many fixed-function elements next to their shader cores. Vertex fetch, texture sampling (with mip level calculation, trilinear and anisotropic filtering), tessellation, rasterization, blending, all kinds of primitive culling and pixel testing, even vertex homogeneous divide... Although not included in the calculation of TFLOPS power, these are real transistors with compute capabilities, just very specialized. Do you know any other smart, creative uses of them?

Comments | #rendering #optimization #gpu Share

# Tips for Using Perforce

Jun 2021

Version Control Systems are tools that every programmer should use. Among them, Git is probably the most popular one. Some companies use Perforce instead. Whether it is better or worse is hard to tell, but it has its advantages that make it indispensable in some types of projects, like game development. Perforce handles large binary files very well. Even if the files have tens or a hundred of gigabytes, it still works fine. I talk about the size of one local copy here, not the entire repository on the server.

From user’s perspective, Perforce differs greatly from Git or SVN. Not only commands are named differently (e.g. there is “Submit” instead of “Commit”), but the whole concept of “changelists” is something that needs to be well understood to be used efficiently. While working with Perforce for many years in different companies and projects, I learned some good practices that I would like to share here. Writing them down was difficult as they seem obvious to me, but hopefully some of them are not obvious to you so you will learn something new.

1. Paste paths to address bar

Let’s start with a simple one. Perforce window has a text box on the top that resembles address bar in web browsers. It shows the path of the currently selected file or directory in Depot or Workspace tab. It can also accept input.

When you work on some file in another tool and you want to jump quickly to it in Perforce, e.g. to check it out, just copy the full path of the file to system clipboard and paste it in this “address bar”. Selection in Workspace tab will switch to it immediately.

Read full entry > | Comments | #tools Share

# Intrusive Linked List in C++

May 2021

A doubly linked list is one of the most fundamental data structures. Each element contains, besides the value we want to store in this container, also a pointer to the previous and next element. This may not be the best choice for indexing i-th element or even traversing all elements quickly (as they are scattered in memory, performance may suffer because of poor cache utilization), but inserting an removing an element from any place in the list is quick.

Source: Doubly linked list at Wikipedia.

Inserting and removing elements is quick, but not necessarily very simple in terms of code complexity. We have to change pointers in the current element, previous and next one, as well as handle special cases when the current element is the first one (head/front) or the last one (tail/back) – a lot of special cases, which may be error-prone.

Therefore it is worth to encapsulate the logic inside some generic, container class. This authors of STL library did by defining List class inside #include <list>. It is a template, where each item of the list will contain our type T plus additional data needed – most likely pointer to the next and previous item.

struct MyStructure {
int MyNumber;
std::list<MyStructure> list;

In other words, our structure is contained inside one that is defined internally by STL. After resolving template, it may look somewhat like this:

struct STL_ListItem {
STL_ListItem *Prev, *Next;
MyStructure UserData;

What if we want to do the opposite – to contain “utility” pointers needed to implement the list inside our custom structure? Maybe we have a structure already defined and cannot change it or maybe we want each item to be a member of two different lists, e.g. sorted by different criteria, and so to contain two pairs of previous-next pointers? A definition of such structure is easy to imagine, but can we still implement some generic class of a list to hide all the complex logic of inserting and removing elements, which would work on our own structure?

struct MyStructure {
int MyNumber = 0;
MyStructure *Prev = nullptr, *Next = nullptr;

If we could do that, such data structure could be called an “intrusive linked list”, just like an “intrusive smart pointer” is a smart pointer which keeps reference counter inside the pointed object. Actually, all that our IntrusiveLinkedList class needs to work with our custom item structure, besides the type itself, is a way to access the pointer to the previous and next element. I came up with an idea to provide this access using a technique called “type traits” – a separate structure that exposes specific interface to deliver information on some other type. In our case, it is to read (for const pointer) or access by reference (for non-const pointer) the previous and next pointer.

The traits structure for MyStructure may look like this:

struct MyStructureTypeTraits {
typedef MyStructure ItemType;
static ItemType* GetPrev(const ItemType* item) { return item->Prev; }
static ItemType* GetNext(const ItemType* item) { return item->Next; }
static ItemType*& AccessPrev(ItemType* item) { return item->Prev; }
static ItemType*& AccessNext(ItemType* item) { return item->Next; }

By having this, we can implement a class IntrusiveLinkedList<ItemTypeTraits> that will hold a pointer to the first and last item on the list and be able to insert, remove, and do other operations on the list, using a custom structure of an item, with custom pointers to previous and next item inside.

IntrusiveLinkedList<MyStructureTypeTraits> list;

list.PushBack(new MyStructure{1});
list.PushBack(new MyStructure{2});

for(MyStructure* i = list.Front(); i; i = list.GetNext(i))
printf("%d\n", i->MyNumber); // prints 1, 2

delete list.PopBack();

I know this is nothing special, there are probably many such implementations on the Internet already, but I am happy with the result as it fulfilled my specific need elegantly.

To see the full implementation of my IntrusiveLinkedList class, go to D3D12MemAlloc.cpp file in D3D12 Memory Allocator library. One caveat is that the class doesn't allocate or free memory for the list items – this must be done by the user.

Comments | #algorithms #c++ Share

# VkExtensionsFeaturesHelp - My New Library

Apr 2021

I had this idea for quite some time and finally I've spent last weekend coding it, so here it is: 611 lines of code (and many times more of documentation), shared for free on MIT license:

** VkExtensionsFeaturesHelp **

Vulkan Extensions & Features Help, or VkExtensionsFeaturesHelp, is a small, header-only, C++ library for developers who use Vulkan API. It helps to avoid boilerplate code while creating VkInstance and VkDevice object by providing a convenient way to query and then enable:

  • instance layers
  • instance extensions
  • instance feature structures
  • device features
  • device extensions
  • device feature structures

The library provides a domain-specific language to describe the list of required or supported extensions, features, and layers. The language is fully defined in terms of preprocessor macros, so no custom build step is needed.

Any feedback is welcome :)

Comments | #productions #vulkan #rendering Share

# Myths About Floating-Point Numbers

Mar 2021

Floating-point numbers are a great invention in computer science, but they can also be tricky and troublesome to use correctly. I’ve written about them already by publishing Floating-Point Formats Cheatsheet and presentation “Pitfalls of floating-point numbers” (“Pułapki liczb zmiennoprzecinkowych” – the slides are in Polish). Last year I was preparing for a more extensive talk about this topic, but it got cancelled, like pretty much everything in these hard times of the COVID-19 pandemic. So in this post, I would like to approach this topic from a different angle.

A programmer can use floating-point numbers on different levels of understanding. A beginner would use them, trusting they are infinitely capable and precise, which can lead to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices the problems can be avoided. An advanced programmer understands what is really going on inside these numbers and can use them with a full awareness of what to expect from them. This post may help you jump from step 2 to step 3. Commonly adopted good practices are called “myths” here, but they are actually just generalizations and simplifications. They can be useful for avoiding errors, unless you understand what is true and what is false about them on a deeper level.

1. They are not exact

It is not true that 2.0 + 2.0 can give 3.99999. It will always be 4.0. They are exact to the extent of their limited range and precision. If you assign a floating-point number some constant value, you can safely compare it with the same value later, even using the discouraged operator ==, as long as it is not a result of some calculations. Imprecisions don't come out of nowhere.

Instead of using integer loop iterator and converting it to float every time:

for(size_t i = 0; i < count; ++i)
    float f = (float)i;
    // Use f

You can do this, which will result in a much more efficient code:

for(float f = 0.f; f < (float)count; f += 1.f)
    // Use f

It is true, however, that your numbers may not look exactly as expected because:

  • Some fractions cannot be represented exactly – even some simple ones like decimal 0.1, which is binary 0.0001101… This is because we humans normally use decimal system, while floating-point numbers, like other numbers inside computers, use binary system – a different base.
  • There is a limited range of integer numbers that can be represented exactly. For 32-bit floats it is only 16,777,216. Above that, numbers start “jumping” every 2, then every 4, etc. So it is not a good idea to use floating-point numbers to represent file sizes if your files are bigger than 16 MB. If count in the example above was >16M, it would cause an infinite loop.

64-bit “double”, however, represents integers exactly up to 9,007,199,254,740,992, so it should be enough for most applications. No wonder that some scripting languages do just fine while supporting only “double” floating-point numbers and no integers at all.

2. They are non-deterministic

It is not true that cosmic radiation will flip the least significant bit at random. Random number generators are also not involved. If you call the same function with your floating-point calculations with same input, you will get the same output. It is fully deterministic, like other computing. (Note: When old FPU instructions are generated rather than new SSE, this can be really non-deterministic and even a task switch may alter your numbers. See this tweet.)

It is true, however, that you may observe different results because:

  • Compiler optimizations can influence the result. If you implement two versions of your formula, similar but not exactly the same, the compiler may, for example, optimize (a * b + c) from doing MUL + ADD to FMA (fused multiply-add) instruction, which does the 3-argument operation in one step. FMA has higher precision, but can then give a different result than two separate instructions.
  • You may observe different results on different platforms – e.g. AMD vs Intel CPU or AMD vs NVIDIA GPU. This is because floating-point standard (IEEE 754) defines only required precision of operations like sin, cos, etc., so the exact result may vary on the least significant bit.

I heard a story of a developer who tried to calculate hashes from the results of his floating-point calculations in a distributed system and discovered that records with what was supposed to be same data had different hashes on different machines.

I once had to investigate a user complaint about a following piece of shader code (in GLSL language). The user said that on AMD graphics cards for uv.x higher than 306 it always returns black color (zero).

vec4 fragColor = vec4(vec3(fract(sin(uv.x * 2300.0 * 12000.0))), 1.0);

I noticed that the value passed to sine function is very high. For uv.x = 306 it is 27,600,000. If we recall from math classes that sine cycles between -1 and 1 every 2*PI ≈ 6.283185 and we take into consideration that above 16,777,216 a 32-bit float cannot represent all integer numbers exactly, but start jumping every 2, then every 4 etc., we can conclude that we have not enough precision to know whether our result should be -1, 1, or anything in between. It is just undefined.

I then asked the user what is he trying to achieve with this code, as the result is totally random. He said it is indeed suppposed to be... a random number generator. The problem is that the result being always 0 is as valid as any other. The reason random numbers are generated on NVIDIA cards and not on AMD is that sine instruction on AMD GPU architectures actually has period of 1, not 2*PI. But it is still fully deterministic in regards to input value. It just returns different results between different platforms.

3. NaN and INF are indication of an error

It is true that if you don’t expect them, their appearance may indicate an error, either in your formulas or in input data (e.g. numbers very large, very small and close to zero, or just garbage binary data). It is also true that they can cause trouble as they propagate through calculations, e.g. every operation with NaN returns NaN.

However, it is not true that these special values are just a means of returning error or that they are not useful. They are perfectly valid special cases of the floating-point representation and have clearly defined behavior. For example, -INF is smaller and +INF is larger than any finite number. You can use this property to implement following function with a clearly documented interface:

#include <limits>

// Finds and returns maximum number from given array.
// For empty array returns -INF.
float CalculateMax(const float* a, size_t count)
    float max = -std::numeric_limits<float>::infinity();
    for(size_t i = 0; i < count; ++i)
        if(a[i] > max)
            max = a[i];
    return max;


As you can see, common beliefs about floating-point numbers - that they are not exact, non-deterministic, or that NaN and INF are an indication of an error, are some generalizations and simplifications that can help to avoid errors, but they don’t tell the full story. To really understand what's going on on a deeper level:

  • Keep in mind which values in your program are just input data or constants and which are results of some calculations.
  • Know the capabilities and limitations of floating point types - their maximum range, minimum possible number, precision in terms of binary or decimal places, maximum interger represented exactly etc.
  • Learn about how floating point numbers are stored, bit by bit.
  • Learn about special values - INF, NaN, positive and negative zero, denormals. Understand how they behave in computations.
  • Take a look at assembly generated by the compiler to see how CPU or GPU really operates on your numbers.

Update 2021-06-09: This article has been published as a guest post on C++ Stories and spawned an interesting discussion on Reddit that is worth reading.

Comments | #math Share

Older entries >


Pinboard Bookmarks



Blog Tags

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2021