Blog

# Book Review: C++ Lambda Story

Wed
13
Jan 2021

C++ Lambda Story book

Courtesy its author Bartłomiej Filipek, I was given an opportunity to read a new book “C++ Lambda Story”. Here is my review.

A book with 149 pages would be too short to teach entire C++, even in its basics, but this one is about a specific topic, just one feature of the language – lambda expressions. For this, it may seem like even too many, depending on how deeply the author goes into the details. To find out, I read the book over 3 evenings.

The book starts from the very beginning – the description of the problem in C++98/03, where lambdas were not available in the language, so we had to write functors - structs or classes with overloaded operator(). Then he moves on to describing lambdas as introduced in C++11, their syntax and all the features. Every feature described is accompanied by a short and clear example. These examples also have links to the same code available in online compilers like Wandbox, Compiler Explorer, or Coliru.

In the next chapters, the author describes what has been added to lambdas in new language revisions – C++14, C++17, C++20, and how other new features introduced to the language interact with lambdas – e.g. consteval and concepts from C++20.

Not only features of lambda expressions are described but also some quirks that every programmer should know. What if a lambda outlives the scope where it was created? What may happen when it is called on multiple threads in parallel? The book answers all these questions, illustrating each with a short, yet complete example.

Sometimes the author describes tricks that may seem too sophisticated. It turns out you can make a recursive call of your lambda, despite not directly supported, by defining a helper generic lambda inside your lambda. You can also derive your class from the implicit class defined by a lambda, or many of them, to have your operator() overloaded for different parameter types.

Your tolerance to such tricks depends on whether you are a proponent of “modern C++” and using all its features, or you prefer simple code, more like “C with classes”. Nonetheless, lambda expressions by themselves are a great language feature, useful in many cases. The book mentions some of these cases, as it ends with a chapter “Top Five Advantages of C++ Lambda Expressions”.

Overall, I like the book a lot. It describes this specific topic of lambda expressions in C++ comprehensively, but still in a concise and clear way. I recommend it to every C++ programmer. Because it is not very long, you shouldn’t hesitate with reading it like it was a new project you need to find time for. You should rather treat it like an additional, valuable learning resource, as if you read several blog articles or watched some YouTube videos about a topic of your interest.

You can buy the book on Leanpub. I also recommend visiting authors blog: C++ Stories (new blog converted from bfilipek.com). See also my review of his previous book: “C++17 in Detail”. There is a discount for “C++ Lambda Story” ($5.99 instead of $8.99) here, as well as for both books ($19.99 instead of $23.99) here - valid until the end of February 2021.

Comments | #books #c++ Share

# States and Barriers of Aliasing Render Targets

Tue
22
Dec 2020

In my recent post “Initializing DX12 Textures After Allocation and Aliasing” I said that when you use Direct3D 12, you have a render-target or depth-stencil texture “B” aliasing with some other resources “A” in memory and want to use it, you need to initialize it properly every time: 1. issue a barrier of type D3D12_RESOURCE_BARRIER_TYPE_ALIASING, 2. fully initialize its content using Clear, Discard, or Copy, so that garbage data left from previous usage of the memory as overwritten.

That is actually not the full story. There is a third thing that you need to take care of: a transition barrier for this resource. This is because functions: DiscardResource, ClearRenderTargetView, ClearDepthStencilView require a texture to be in a correct state – D3D12_RESOURCE_STATE_RENDER_TARGET or D3D12_RESOURCE_STATE_DEPTH_WRITE. A question arises here: How does resource state tracking interact with memory aliasing?

If you look at error and warning messages from D3D Debug Layer or PIX, you notice that states are tracked per resource, no matter if they alias in memory. This gives us some clue of how we should handle it correctly to tame the Debug Layer and make things formally correct, but it is still not that obvious where to put the barrier. Let's assume the last usage of our texture B is D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE – the texture is used for sampling. Where should we put the barrier to transition it to D3D12_RESOURCE_STATE_RENDER_TARGET, required to do the initializing Discard or Clear after aliasing? I can see 3 solutions here:

1. None at all – skip this barrier, just do the Discard/Clear as if the texture was in the correct RT/DS state. It obviously generates Debug Layer error, but it seems to work fine. My thinking about why this may be fine is: If the Discard/Clear requires the texture to be in the correct RT/DS state and it completely overwrites its content, then it shouldn’t care what initial state the texture is in. Its data are garbage anyway. It will most probably leave the texture initialized properly as for RT/DS state, no matter what.

2. Between the aliasing barrier and Discard/Clear. This would please the Debug Layer, but I have some doubts whether it is correct or necessary. It may not be correct because if the barrier does some transformations of the data, it would transform garbage data left after aliasing, which could be an undefined behavior leading to some persisting corruptions or even to GPU crash. It may not be necessary because we are going to re-initialize the whole content of the texture in a moment anyway with Discard/Clear.

3. After finished working with the texture in the previous frame. The idea is to leave each RT/DS texture that can alias with some other resource in a state that makes it ready for the initializing Discard/Clear next time we want to grab it from the common memory. This will work and will be formally correct, but it also sounds like a well-known and sub-optimal idea of reverting textures to some “base state” after each use. Additional problem appears when your last usage of the texture occurs on the compute queue, because there you cannot transition it to RENDER_TARGET or DEPTH_WRITE state.

Conclusion: While this is a tricky question that has no one simple answer, I would recommend to use [3] if you want to be 100% formally correct or [1] if you want maximum efficiency.

Comments | #directx #rendering Share

# CasCmdLine - Few Technical Details

Sun
15
Nov 2020

As part of my job, I've written a small console program CasCmdLine with a purpose of testing AMD's FidelityFX Contrast Adaptive Sharpening (CAS) shader on an image from disk, e.g. a screenshot from your game. You can find binary and source code on github.com/GPUOpen-Effects/FidelityFX-CAS/, CasCmdLine subdirectory. See also the blog post and tutorial about it to learn about its features and the syntax of supported command line parameters.

Here I would like to point to three aspects of its implementation that allowed me to make it small and simple. They might interest you if you are a C++/Windows/graphics programmer.

1. To execute a compute shader like CAS, I needed to use a graphics API - Direct3D 11, 12, or Vulkan, as all of them are supported by the effect. I chose D3D11 as the easiest one. What’s interesting is that the API is used without creating a window or swap chain. There are no render frames, no calls to Present, no depth-stencil texture, no message loop. D3D11CreateDevice is used to initialize DirectX rather than D3D11CreateDeviceAndSwapChain. The program just initializes all necessary machinery, does its job, and exits. It is perfectly possible to write a program this way, which may be a good idea for any application that needs to do some GPU-accelerated computations rather than interactive graphics like games do. I suspect this mode of operation would work even on server systems that have no monitor attached, as long as there is a GPU and graphics driver installed. See file “CasCmdLine.cpp” to find out how this is implemented.

2. There is always a question in every graphics app about how to load shaders. Surely, compiling them from HLSL/GLSL source code is the worst option, as it requires the user to have shader compiler installed or attach the compiler to your program. It also takes more time than loading shaders precompiled to the intermediate binary format. But even in this format they need to be loaded from somewhere, whether individual files or some custom compressed archive, like games tend to do. In CasCmdLine I did it differently. I attached precompiled shaders directly to the program binary. To do that, I used command line parameter /Fh of the "fxc.exe" shader compiler, like this:

fxc.exe /T cs_5_0 /E mainCS /O3 /Fh CompiledShader.h ShaderSource.hlsl

Instead of a binary file, the compiler called with this parameter generates a text file in a format compatible with C/C++ that contains the data of the compiled shader in form of an array, like this:

#if 0
Shader metadata and assembly is put here, as commented out code...
#endif

const BYTE g_mainCS[] =
{
     68,  88,  66,  67,   8, 233, 
     11,  94, 141, 165,  83, 251, 
     50, 166, 219, 219,  84, 109, 
    128,  23,   1,   0,   0,   0, 
    (...)
};

Such file can be #include-d in a C++ code and used to create a D3D shader directly from this data. See files "Shaders/CompiledShader_*.h" to find out how they really look like.

3. The program needs to load and save image files in JPEG, PNG, and preferably other formats. Of course, these formats are very complex, support various pixel formats, involve some compression algorithms etc., so handling them manually would require an enormous amount of work. There are libraries for this, like the official libpng and libjpeg for handling PNG/JPEG formats, respectively, or a multi-format, multi-platform library DevIL.

If the developed program is intended only for Windows, it turns out that no third-party libraries are needed. Native Windows API contains a part called Windows Imaging Component (WIC) that can load and save image files in many formats, including BMP, PNG, JPEG, TIFF, GIF, ICO, WMP, DDS. It can also do some image operations, like rescaling. It is a COM API that involves interfaces like IWICImagingFactory, IWICBitmapDecoder, IWICBitmapFrameDecode, and many more. This is what I used in the program described here. I might write a tutorial about WIC someday... For now, I would just say if you figure out its API, it looks quite powerful. It might be useful for any graphics Windows app that needs to load textures. It is also what Microsoft's DirectXTex library uses under the hood.

Comments | #directx #rendering Share

# Bezier Curve as Easing Function

Sat
24
Oct 2020

Bézier curves are named after Pierre Bézier, and primary used is geometry modeling. They are good at describing various shapes in 2D and 3D. A Bézier curve is a function x(t), y(t) - it gives points in space (x, y) for some parameter t = 0..1. But nowadays they are also used in computer graphics for animation, as easing functions. There, we need to evaluate y(x), because x is the time parameter and y is the evaluated variable.

How does the formula of a Bézier curve look like as y(x)? What constraints do the 4 control points need to meet for this function to be correct - to have only one value of y for each x, with no loops or arcs? Finally, how can this function be approximated to store it in computer memory and evaluate it efficiently in modern game engines? These sound like fundamental questions, but apparently no one researched this topic thoroughly before, so it became the subject of the Ph.D. thesis of my friend Łukasz Izdebski.

A part of his research has just been published as paper "Bézier Curve as a Generalization of the Easing Function in Computer Animation" in Advances in Computer Graphics, 37th Computer Graphics International Conference, CGI 2020, Geneva, Switzerland. We want to share an excerpt of his findings online as an article: Bezier Curve as Easing Function.

Comments | #math #rendering Share

# A Better Way to Scalarize a Shader

Tue
20
Oct 2020

This will be an advanced article. It assumes you not only know how to write shaders but also how they work on a low level (like vector versus scalar registers) and how to optimize them using scalarization. It all starts from a need to index into an array of texture or buffer descriptors, where the index is dynamic – it may vary from pixel to pixel. This is useful e.g. when doing bindless-style rendering or blending various layers of textures e.g. on a terrain. To make it working properly in a HLSL shader, you need to surround the indexing operation with a pseudo-function NonUniformResourceIndex. See also my old blog post “Direct3D 12 - Watch out for non-uniform resource index!”.

Texture2D g_Textures[] : register(t1);
...
return g_Textures[NonUniformResourceIndex(textureIndex)].Load(pos);

In many cases, it is enough. The driver will do its magic to make things working properly. But if your logic dependent on textureIndex is more complex than a single Load or SampleGrad, e.g. you sample multiple textures or do some calculations (let's call it MyDynamicTextureIndexing), then it might be beneficial to scalarize the shader manually using a loop and wave functions from HLSL Shader Model 6.0.

I learned how to do scalarization from the 2-part article “Intro to GPU Scalarization” by Francesco Cifariello Ciardi and the presentation “Improved Culling for Tiled and Clustered Rendering” by Michał Drobot, linked from it. Both sources propose an implementation like the following HLSL snippet:

// WORKING, TRADITIONAL
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
while(any(currThreadMask & activeThreadsMask) != 0)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   uint2 scalarTextureIndexThreadMask = WaveActiveBallot(scalarTextureIndex == textureIndex).xy;
   activeThreadsMask &= ~scalarTextureIndexThreadMask;
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
   }
}
return color;

It involves a bit mask of active threads. From the moment I first saw this code, I started wondering: Why is it needed? A mask of threads that still want to continue spinning the loop is already maintained implicitly by the shader compiler. Couldn't we just break; from the loop when done with the textureIndex of the current thread?! So I wrote this short piece of code:

// BAD, CRASHES
float4 color = float4(0.0, 0.0, 0.0, 0.0);
while(true)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
       break;
   }
}
return color;

…and it crashed my GPU. At first I thought it may be a bug in the shader compiler, but then I recalled footnote [2] in part 2 of the scalarization tutorial, which mentions an issue with helper lanes. Let me elaborate on this. When a shader is executed in SIMT fashion, individual threads (lanes) may be active or inactive. Active lanes are these that do their job. Inactive lanes may be inactive from the very beginning because we are at the edge of a triangle so there are not enough pixels to make use of all the lanes or may be disabled temporarily because e.g. we are executing an if section that some threads didn't want to enter. But in pixel shaders there is a third kind of lanes – helper lanes. These are used instead of inactive lanes to make sure full 2x2 quads always execute the code, which is needed to calculate derivatives ddx/ddy, also done explicitly when sampling a texture to calculate the correct mip level. A helper lane executes the code (like an active lane), but doesn't export its result to the render target (like an inactive lane).

As it turns out, helper lanes also don't contribute to wave functions – they work like inactive lanes. Can you already see the problem here? In the loop shown above, it may happen than a helper lane has its textureIndex different from any active lanes within a wave. It will then never get its turn to process it in a scalar fashion, so it will fall into an infinite loop, causing GPU crash (TDR)!

Then I thought: What if I disable helper lanes just once, before the whole loop? So I came up with the following shader. It seems to work fine. I also think it is better than the first solution, as it operates on the thread bit mask only once at the beginning and so uses fewer variables to be stored in GPU registers and does fewer calculations in every loop iteration. Now I'm thinking whether there is something wrong with my idea that I can't see now? Or did I just invent a better way to scalarize shaders?

// WORKING, NEW
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
[branch]
if(any((currThreadMask & activeThreadsMask) != 0))
{
   while(true)
   {
       uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
       [branch]
       if(scalarTextureIndex == textureIndex)
       {
           color = MyDynamicTextureIndexing(textureIndex);
           break;
       }
   }
}
return color;

UPDATE 2020-10-28: There are some valuable comments under my tweet about this topic that I recommend to check out.

Comments | #optimization #directx #gpu Share

# Which Values Are Scalar in a Shader?

Wed
14
Oct 2020

GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it cbScaleFactor, or any calculation on such data, like (cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.

Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like SV_Position in a pixel shader (denoting the position of the current pixel on the screen), SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.

But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.

Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.

This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like WaveReadLaneFirst, WaveActiveMax, WaveActiveAllTrue return the same value for all threads, so it’s always scalar.

Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.

Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed LaneIndex to be any value by making this GitHub commit to their documentation.

If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.

SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.

SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.

SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using [numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases, SV_GroupThreadID.y wouldn’t be scalar.

Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.

Comments | #gpu #directx #vulkan #optimization Share

# System Value Semantics in Compute Shaders - Cheat Sheet

Tue
29
Sep 2020

After compute shaders appeared, programmers no longer need to pretend they do graphics and render pixels when they want to do some general-purpose computations on a GPU (GPGPU). They can just dispatch a shader that reads and writes memory in a custom way. Such shader is a short (or not so short) program to be invoked thousands or millions of times to process a piece of data. To work correctly, it needs to know which is the current thread. Threads (invocations) of a compute shader are not just indexed linearly as 0, 1, 2, ... It's more complex than that. Their indexing can use up to 3 dimensions, which simplifies operation on some data like images or matrices. They also come in groups, with the number of threads in one group declared statically as part of the shader code and the number of groups to execute passed dynamically in CPU code when dispatching the shader.

This raises a question of how to identify the current thread. HLSL offers a number of system-value semantics for this purpose and so does GLSL by defining equivalent built-in variables. For long time I couldn't remember their names, as the ones in HLSL are quite misleading. If GroupID is an ID of the entire group, and GroupThreadID is an ID of the thread within a group, GroupIndex should be a flattened index of the entire group, right? Wrong! It's actually an index of a single thread within a group. GLSL is more consistent in this regard, clearly stating "WorkGroup" versus "Invocation" and "Local" versus "Global". So, although Microsoft provides a great explanation of their SVs with a picture on pages like SV_DispatchThreadID, I thought it would be nice to gather all this in form of a table, a small cheat sheet:

HLSL SemanticsGLSL VariableType (Dimension)UnitReference
SV_GroupIDgl_WorkGroupIDuint3 (3D)Entire groupGlobal in dispatch
SV_GroupThreadIDgl_LocalInvocationIDuint3 (3D)Single threadLocal in group
SV_DispatchThreadIDgl_GlobalInvocationIDuint3 (3D)Single threadGlobal in dispatch
SV_GroupIndexgl_LocalInvocationIndexuint (flattened)Single threadLocal in group

Comments | #gpu #directx #opengl #vulkan Share

# AquaFish 2 - My Game From 2009

Thu
06
Aug 2020

I've made a short video showing a game I developed more than 10 years ago: AquaFish 2. It was my first commercial project, published by Play Publishing, developed using my custom engine The Final Quest.

Comments | #productions #history Share

Older entries >

Twitter

Pinboard Bookmarks

LinkedIn

Blog Tags

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2020