Blog

# CasCmdLine - Few Technical Details

Sun
15
Nov 2020

As part of my job, I've written a small console program CasCmdLine with a purpose of testing AMD's FidelityFX Contrast Adaptive Sharpening (CAS) shader on an image from disk, e.g. a screenshot from your game. You can find binary and source code on github.com/GPUOpen-Effects/FidelityFX-CAS/, CasCmdLine subdirectory. See also the blog post and tutorial about it to learn about its features and the syntax of supported command line parameters.

Here I would like to point to three aspects of its implementation that allowed me to make it small and simple. They might interest you if you are a C++/Windows/graphics programmer.

1. To execute a compute shader like CAS, I needed to use a graphics API - Direct3D 11, 12, or Vulkan, as all of them are supported by the effect. I chose D3D11 as the easiest one. What’s interesting is that the API is used without creating a window or swap chain. There are no render frames, no calls to Present, no depth-stencil texture, no message loop. D3D11CreateDevice is used to initialize DirectX rather than D3D11CreateDeviceAndSwapChain. The program just initializes all necessary machinery, does its job, and exits. It is perfectly possible to write a program this way, which may be a good idea for any application that needs to do some GPU-accelerated computations rather than interactive graphics like games do. I suspect this mode of operation would work even on server systems that have no monitor attached, as long as there is a GPU and graphics driver installed. See file “CasCmdLine.cpp” to find out how this is implemented.

2. There is always a question in every graphics app about how to load shaders. Surely, compiling them from HLSL/GLSL source code is the worst option, as it requires the user to have shader compiler installed or attach the compiler to your program. It also takes more time than loading shaders precompiled to the intermediate binary format. But even in this format they need to be loaded from somewhere, whether individual files or some custom compressed archive, like games tend to do. In CasCmdLine I did it differently. I attached precompiled shaders directly to the program binary. To do that, I used command line parameter /Fh of the "fxc.exe" shader compiler, like this:

fxc.exe /T cs_5_0 /E mainCS /O3 /Fh CompiledShader.h ShaderSource.hlsl

Instead of a binary file, the compiler called with this parameter generates a text file in a format compatible with C/C++ that contains the data of the compiled shader in form of an array, like this:

#if 0
Shader metadata and assembly is put here, as commented out code...
#endif

const BYTE g_mainCS[] =
{
     68,  88,  66,  67,   8, 233, 
     11,  94, 141, 165,  83, 251, 
     50, 166, 219, 219,  84, 109, 
    128,  23,   1,   0,   0,   0, 
    (...)
};

Such file can be #include-d in a C++ code and used to create a D3D shader directly from this data. See files "Shaders/CompiledShader_*.h" to find out how they really look like.

3. The program needs to load and save image files in JPEG, PNG, and preferably other formats. Of course, these formats are very complex, support various pixel formats, involve some compression algorithms etc., so handling them manually would require an enormous amount of work. There are libraries for this, like the official libpng and libjpeg for handling PNG/JPEG formats, respectively, or a multi-format, multi-platform library DevIL.

If the developed program is intended only for Windows, it turns out that no third-party libraries are needed. Native Windows API contains a part called Windows Imaging Component (WIC) that can load and save image files in many formats, including BMP, PNG, JPEG, TIFF, GIF, ICO, WMP, DDS. It can also do some image operations, like rescaling. It is a COM API that involves interfaces like IWICImagingFactory, IWICBitmapDecoder, IWICBitmapFrameDecode, and many more. This is what I used in the program described here. I might write a tutorial about WIC someday... For now, I would just say if you figure out its API, it looks quite powerful. It might be useful for any graphics Windows app that needs to load textures. It is also what Microsoft's DirectXTex library uses under the hood.

Comments | #directx #rendering Share

# Bezier Curve as Easing Function

Sat
24
Oct 2020

Bézier curves are named after Pierre Bézier, and primary used is geometry modeling. They are good at describing various shapes in 2D and 3D. A Bézier curve is a function x(t), y(t) - it gives points in space (x, y) for some parameter t = 0..1. But nowadays they are also used in computer graphics for animation, as easing functions. There, we need to evaluate y(x), because x is the time parameter and y is the evaluated variable.

How does the formula of a Bézier curve look like as y(x)? What constraints do the 4 control points need to meet for this function to be correct - to have only one value of y for each x, with no loops or arcs? Finally, how can this function be approximated to store it in computer memory and evaluate it efficiently in modern game engines? These sound like fundamental questions, but apparently no one researched this topic thoroughly before, so it became the subject of the Ph.D. thesis of my friend Łukasz Izdebski.

A part of his research has just been published as paper "Bézier Curve as a Generalization of the Easing Function in Computer Animation" in Advances in Computer Graphics, 37th Computer Graphics International Conference, CGI 2020, Geneva, Switzerland. We want to share an excerpt of his findings online as an article: Bezier Curve as Easing Function.

Comments | #math #rendering Share

# A Better Way to Scalarize a Shader

Tue
20
Oct 2020

This will be an advanced article. It assumes you not only know how to write shaders but also how they work on a low level (like vector versus scalar registers) and how to optimize them using scalarization. It all starts from a need to index into an array of texture or buffer descriptors, where the index is dynamic – it may vary from pixel to pixel. This is useful e.g. when doing bindless-style rendering or blending various layers of textures e.g. on a terrain. To make it working properly in a HLSL shader, you need to surround the indexing operation with a pseudo-function NonUniformResourceIndex. See also my old blog post “Direct3D 12 - Watch out for non-uniform resource index!”.

Texture2D g_Textures[] : register(t1);
...
return g_Textures[NonUniformResourceIndex(textureIndex)].Load(pos);

In many cases, it is enough. The driver will do its magic to make things working properly. But if your logic dependent on textureIndex is more complex than a single Load or SampleGrad, e.g. you sample multiple textures or do some calculations (let's call it MyDynamicTextureIndexing), then it might be beneficial to scalarize the shader manually using a loop and wave functions from HLSL Shader Model 6.0.

I learned how to do scalarization from the 2-part article “Intro to GPU Scalarization” by Francesco Cifariello Ciardi and the presentation “Improved Culling for Tiled and Clustered Rendering” by Michał Drobot, linked from it. Both sources propose an implementation like the following HLSL snippet:

// WORKING, TRADITIONAL
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
while(any(currThreadMask & activeThreadsMask) != 0)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   uint2 scalarTextureIndexThreadMask = WaveActiveBallot(scalarTextureIndex == textureIndex).xy;
   activeThreadsMask &= ~scalarTextureIndexThreadMask;
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
   }
}
return color;

It involves a bit mask of active threads. From the moment I first saw this code, I started wondering: Why is it needed? A mask of threads that still want to continue spinning the loop is already maintained implicitly by the shader compiler. Couldn't we just break; from the loop when done with the textureIndex of the current thread?! So I wrote this short piece of code:

// BAD, CRASHES
float4 color = float4(0.0, 0.0, 0.0, 0.0);
while(true)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
       break;
   }
}
return color;

…and it crashed my GPU. At first I thought it may be a bug in the shader compiler, but then I recalled footnote [2] in part 2 of the scalarization tutorial, which mentions an issue with helper lanes. Let me elaborate on this. When a shader is executed in SIMT fashion, individual threads (lanes) may be active or inactive. Active lanes are these that do their job. Inactive lanes may be inactive from the very beginning because we are at the edge of a triangle so there are not enough pixels to make use of all the lanes or may be disabled temporarily because e.g. we are executing an if section that some threads didn't want to enter. But in pixel shaders there is a third kind of lanes – helper lanes. These are used instead of inactive lanes to make sure full 2x2 quads always execute the code, which is needed to calculate derivatives ddx/ddy, also done explicitly when sampling a texture to calculate the correct mip level. A helper lane executes the code (like an active lane), but doesn't export its result to the render target (like an inactive lane).

As it turns out, helper lanes also don't contribute to wave functions – they work like inactive lanes. Can you already see the problem here? In the loop shown above, it may happen than a helper lane has its textureIndex different from any active lanes within a wave. It will then never get its turn to process it in a scalar fashion, so it will fall into an infinite loop, causing GPU crash (TDR)!

Then I thought: What if I disable helper lanes just once, before the whole loop? So I came up with the following shader. It seems to work fine. I also think it is better than the first solution, as it operates on the thread bit mask only once at the beginning and so uses fewer variables to be stored in GPU registers and does fewer calculations in every loop iteration. Now I'm thinking whether there is something wrong with my idea that I can't see now? Or did I just invent a better way to scalarize shaders?

// WORKING, NEW
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
[branch]
if(any((currThreadMask & activeThreadsMask) != 0))
{
   while(true)
   {
       uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
       [branch]
       if(scalarTextureIndex == textureIndex)
       {
           color = MyDynamicTextureIndexing(textureIndex);
           break;
       }
   }
}
return color;

UPDATE 2020-10-28: There are some valuable comments under my tweet about this topic that I recommend to check out.

Comments | #optimization #directx #gpu Share

# Which Values Are Scalar in a Shader?

Wed
14
Oct 2020

GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it cbScaleFactor, or any calculation on such data, like (cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.

Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like SV_Position in a pixel shader (denoting the position of the current pixel on the screen), SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.

But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.

Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.

This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like WaveReadLaneFirst, WaveActiveMax, WaveActiveAllTrue return the same value for all threads, so it’s always scalar.

Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.

Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed LaneIndex to be any value by making this GitHub commit to their documentation.

If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.

SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.

SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.

SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using [numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases, SV_GroupThreadID.y wouldn’t be scalar.

Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.

Comments | #gpu #directx #vulkan #optimization Share

# System Value Semantics in Compute Shaders - Cheat Sheet

Tue
29
Sep 2020

After compute shaders appeared, programmers no longer need to pretend they do graphics and render pixels when they want to do some general-purpose computations on a GPU (GPGPU). They can just dispatch a shader that reads and writes memory in a custom way. Such shader is a short (or not so short) program to be invoked thousands or millions of times to process a piece of data. To work correctly, it needs to know which is the current thread. Threads (invocations) of a compute shader are not just indexed linearly as 0, 1, 2, ... It's more complex than that. Their indexing can use up to 3 dimensions, which simplifies operation on some data like images or matrices. They also come in groups, with the number of threads in one group declared statically as part of the shader code and the number of groups to execute passed dynamically in CPU code when dispatching the shader.

This raises a question of how to identify the current thread. HLSL offers a number of system-value semantics for this purpose and so does GLSL by defining equivalent built-in variables. For long time I couldn't remember their names, as the ones in HLSL are quite misleading. If GroupID is an ID of the entire group, and GroupThreadID is an ID of the thread within a group, GroupIndex should be a flattened index of the entire group, right? Wrong! It's actually an index of a single thread within a group. GLSL is more consistent in this regard, clearly stating "WorkGroup" versus "Invocation" and "Local" versus "Global". So, although Microsoft provides a great explanation of their SVs with a picture on pages like SV_DispatchThreadID, I thought it would be nice to gather all this in form of a table, a small cheat sheet:

HLSL SemanticsGLSL VariableType (Dimension)UnitReference
SV_GroupIDgl_WorkGroupIDuint3 (3D)Entire groupGlobal in dispatch
SV_GroupThreadIDgl_LocalInvocationIDuint3 (3D)Single threadLocal in group
SV_DispatchThreadIDgl_GlobalInvocationIDuint3 (3D)Single threadGlobal in dispatch
SV_GroupIndexgl_LocalInvocationIndexuint (flattened)Single threadLocal in group

Comments | #gpu #directx #opengl #vulkan Share

# AquaFish 2 - My Game From 2009

Thu
06
Aug 2020

I've made a short video showing a game I developed more than 10 years ago: AquaFish 2. It was my first commercial project, published by Play Publishing, developed using my custom engine The Final Quest.

Comments | #productions #history Share

# Why Not Use Heterogeneous Multi-GPU?

Wed
22
Jul 2020

There was an interesting discussion recently on one Slack channel about using integrated GPU (iGPU) together with discrete GPU (dGPU). Many sound ideas were said there, so I think it's worth writing them down. But because I probably never blogged about multi-GPU before, few words of introduction first:

The idea to use multiple GPUs in one program is not new, but not very widespread either. In old graphics APIs like Direct3D 11 it wasn't easy to implement. Doing it right in a complex game often involved engaging driver engineers from the GPU manufacturer (like AMD, NVIDIA) or using custom vendor extensions (like AMD GPU Services - see for example Explicit Crossfire API).

New generation of graphics APIs – Direct3D 12 and Vulkan – are lower level, give more direct access to the hardware. This includes the possibility to implement multi-GPU support on your own. There are two modes of operation. If the GPUs are identical (e.g. two graphics cards of the same model plugged to the motherboard), you can use them as one device object. In D3D12 you then index them as Node 0, Node 1, ... and specify NodeMask bit mask when allocating GPU memory, submitting commands and doing all sorts of GPU things. Similarly, in Vulkan you have VK_KHR_device_group extension available that allows you to create one logical device object that will use multiple physical devices.

But this post is about heterogeneous/asymmetric multi-GPU, where there are two different GPUs installed in the system, e.g. one integrated with the CPU and one discrete. A common example is a laptop with "switchable graphics", which may have an Intel CPU with their integrated “HD” graphics plus a NVIDIA GPU. There may even be two different GPUs from the same manufacturer! My new laptop (ASUS TUF Gaming FX505DY) has AMD Radeon Vega 8 + Radeon RX 560X. Another example is a desktop PC with CPU-integrated graphics and a discrete graphics card installed. Such combination may still be used by a single app, but to do that, you must create and use two separate Device objects. But whether you could, doesn't mean you should…

First question is: Are there games that support this technique? Probably few… There is just one example I heard of: Ashes of the Singularity by Oxide Games, and it was many years ago, when DX12 was still fresh. Other than that, there are mostly tech demos, e.g. "WITCH CHAPTER 0 [cry]" by Square Enix as described on DirectX Developer Blog (also 5 years old).

iGPU typically has lower computational power than dGPU. It could accelerate some pieces of computations needed each frame. One idea is to hand over the already rendered 3D scene to the iGPU so it can finish it with screen-space postprocessing effects and present it, which sounds even better if the display is connected to iGPU. Another option is to accelerate some computations, like occlusion culling, particles, or water simulation. There are some excellent learning materials about this technique. The best one I can think of is: Multi-Adapter with Integrated and Discrete GPUs by Allen Hux (Intel), GDC 2020.

However, there are many drawbacks of this technique, which were discussed in the Slack chat I mentioned:

  • It's difficult to implement multi-GPU support in general and to synchronize things properly.
  • iGPUs have greatly varying performance, from quite fast to very slow, so implementing it to always give a performance uplift is even harder.
  • Passing data back and forth between dGPU and iGPU involves multiple copies. The cost of it may be larger than the performance benefit of computing on iGPU.
  • iGPU shares same power and thermal limitations, memory bandwidth, and caches as the CPU, so they may slow each other down.
  • If you offload finishing render frame (postprocessing and Present) to iGPU, you may improve throughput a bit, but you increase latency a lot.
  • You need to support systems without iGPU as well, so your testing matrix expands. (An interesting idea was posted that if it's a DirectX workload, you might fall back to the software emulated WARP device – it's quite efficient and good quality in terms of correctness and compliance with GPU-accelerated DX).
  • Finishing and presenting a frame on iGPU sounds like a good idea if the display is connected to iGPU, but it's not so certain. Multi-GPU laptops usually have the build-in display connected to the iGPU, but external display output (e.g. HDMI) may be connected to iGPU or to dGPU (especially in "gaming laptops") – you never know.
  • Conscious gamers tend to update their graphics drivers for dGPU, but the driver for iGPU is often left in an ancient version, full of bugs.

Conclusion: Supporting heterogeneous multi-GPU in a game engine sounds like an interesting technical challenge, but better think twice before doing it in a production code.

BTW If you just want to use just one GPU and worry about the selection of the right one, see my old post: Switchable graphics versus D3D11 adapters.

Comments | #rendering #directx #vulkan #microsoft Share

# How to Disable Notification Sound in Messenger for Android?

Thu
09
Jul 2020

Applications and websites fight for our attention. We want to stay connected and informed, but too many interruptions are not good for our productivity or mental health. Different applications have different settings dedicated to silencing notifications. I recently bought a new smartphone and so I needed to install and configure all the apps (which is a big task these days, same way as it always used to be with Windows PC after "format C:" and system reinstall).

Facebook Messenger for Android offers an on/off setting for all the notifications, and a choice of the sound of a notification and an incoming call. Unfortunately, it doesn't offer an option to silence the sound. You can only either choose among several different sound effects or disable all notifications of the app entirely. What if you want to keep notifications active so they appear in the Android drawer, use vibration, get sent to a smart band, and you can hear incoming calls ringing, you just want to mute the sound of incoming messages?

Here is the solution I found. It turns out you can upload a custom sound file to your smartphone and use it. For that I generated a small WAV file - 0.1 seconds of total silence. 1) You can download it from here:

Silence_100ms.wav (8.65 KB)

2) Now you need to put it into a specific directory in the memory of your smartphone, called "Notifications". To do this, you need to use an app that allows to freely manipulate files and directories, as opposed to just looking for specific content as image or music players do. If you downloaded the file directly to your smartphone, use free Total Commander to move this file to the "Notifications" directory. If you have it on your PC, MyPhoneExplorer will be a good app to connect to your phone using a USB cable or WiFi network and transfer the file.

3) Finally, you need to select the file in Messenger. To do this, go to its settings > Notifications & Sounds > Notification Sound. The new file "Silence_100ms" should appear mixed with the list of default sound effects. After choosing it, your message notifications in Messenger will be silent.

Facebook Messenger Android Notification Sound Silence

There is one downside of this method. While not audible, the sound is still playing on every incoming message, so if you listen to music e.g. using Spotify, the music will fade out for a second every time the sound is played.

Comments | #android #mobile Share

Older entries >

Twitter

Pinboard Bookmarks

LinkedIn

Blog Tags

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2020