Tag: gpu

Entries for tag "gpu", ordered from most recent. Entry count: 24.

Pages: 1 2 3 >

# D3d12info - Printing D3D12 GPU Information to Console

Wed
27
Jul 2022

My next little hobby project is D3d12info. It is a Windows console program that prints all the information it can get about the current GPU installed in the system, as seen through Direct3D 12 API. It also fetches additional information through AMD GPU Services (on AMD cards), NVAPI (on NVIDIA cards), Vulkan, and WinAPI, mostly to identify the current version of the graphics driver and Windows system. I will try to keep it updated to the latest Agility SDK, to query it for support for the latest hardware features of the graphics card.

I share it under open-source MIT license. You can see full source code in the GitHub repository and download compiled binary from the Releases tab.

The tool can be compared to DirectX Caps Viewer you can find in your Windows SDK installation under path "c:\Program Files (x86)\Windows Kits\10\bin\*\x64\dxcapsviewer.exe" in terms of the information extracted from DX12. However, instead of GUI, it provides a command-line interface, which makes it similar to the "vulkaninfo" tool. Information is printed in a human-readable text format by default, but JSON format can be selected by providing -j parameter, making it suitable for automated processing. Additional command-line parameters are supported, including a choice of the GPU if there are many installed in the system. Launch it with parameter -h to see the command-line syntax.

In the future, I would like to extend it with a web back-end that would gather a database of various GPUs and driver versions, like Vulkan Hardware Database does for Vulkan, and to make it browsable online. As far as I know, there is no such database for D3D12 at the moment. Best we have right now are the tables about Direct3D Feature Levels on Wikipedia. But that will require a lot of learning from me, as I am not a good web developer, so I will think about it after my vacation :)

Comments | #productions #tools #directx #gpu Share

# A Metric for Memory Fragmentation

Wed
06
Apr 2022

In this article, I would like to discuss the problem of memory fragmentation and propose a formula for calculating a metric telling how badly the memory is fragmented.

Problem statement

The problem can be stated like this:

So it is a standard memory allocation situation. Now, I will explain what do I mean by fragmentation. Fragmentation, for this article, is an unwanted situation where free memory is spread across many small regions in between allocations, as opposed to a single large one. We want to measure it and preferably avoid it because:

A solution to this problem is to perform defragmentation - an operation that moves the allocations to arrange them next to each other. This may require user involvement, as pointers to the allocations will change then. It may also be a time-consuming operation to calculate better places for the allocations and then to copy all their data. It is thus desirable to measure the fragmentation to decide when to perform the defragmentation operation.

Read full entry > | Comments | #gpu #algorithms #optimization Share

# Vulkan Memory Allocator 3.0.0 and D3D12 Memory Allocator 2.0.0

Sat
26
Mar 2022

Yesterday we released new major version of Vulkan Memory Allocator 3.0.0 and D3D12 Memory Allocator 2.0.0, so if you are coding with Vulkan or Direct3D 12, I recommend to take a look at these libraries. Because coding them is part of my job, I won't describe them in detail here, but just refer to my article published on GPUOpen.com: "Announcing Vulkan Memory Allocator 3.0.0 and Direct3D 12 Memory Allocator 2.0.0". Direct links:

Vulkan Memory Allocator

D3D12 Memory Allocator

Comments | #rendering #directx #vulkan #gpu #libraries #productions Share

# First Look at New D3D12 Enhanced Barriers

Thu
09
Dec 2021

This will be pretty advanced or at least intermediate article. It assumes you know Direct3D 12 API. Some references to Vulkan may also appear. I am writing it because I just found out that yesterday Microsoft announced an upcoming big change in D3D12: Enhanced Barriers. It will be an addition to the API that provides a new way to do barriers. Considering my professional interests, this looks very important to me and also quite revolutionary. This article summarizes my first look and my thoughts about this new addition to the API or, speaking in terms of modern internet, my "unboxing" or "reaction" ;)

Bill Kristiansen, the author of the article linked above, written that currently only the software-simulated WARP device supports the new enhanced barriers. Support in real GPU drivers will come at later time. The new barriers can replace the old way of doing them, but both will still be available and can also be mixed in one application. Which means this is not as big revolution to turn our DirectX development upside down - we can switch to them gradually. For now we can just prepare ourselves for the future by studying the interface (which I do in this article) and testing some code using WARP device.

UPDATE 2021-12-10: I just learned that Microsoft actually did publish a documentation of the new API: Enhanced Barriers @ DirectX-Specs, so I recommend to go see it before reading this article.

Read full entry > | Comments | #directx #vulkan #gpu Share

# Creative Use of GPU Fixed-Function Hardware

Wed
22
Sep 2021

I recently broke my rule of posting on my blog at least once a month as I had some other topics and problems to handle in my life, but I'm still alive, still doing graphics programming for a living, so I hope to get back to blogging now. This post is more like a question rather than an answer. It is about creative use of GPU fixed-function hardware. Warning: It may be pretty difficult for beginners, full of graphics programming terms you should already know to understand it. But first, here is some background:

I remember the times when graphics cards were only configurable, not programmable. There were no shaders, only a set of parameters that could control pre-defined operations - transform of vertices, texturing and lighting of pixels. Then, shaders appeared. They evolved by supporting more instructions to be executed and a wider variety of instructions available. At some point, even before the invention of compute shaders, the term “general-purpose computing on GPU” (GPGPU) appeared. Developers started encoding some data as RGBA colors of texture pixels and drawing full-screen quads just to launch calculation of some non-graphical tasks, implemented as pixel shaders. Soon after, compute shaders appeared, so they no longer need to pretend anything - they can now spawn a set of threads that can just read and write memory freely through Direct3D unordered access views aka Vulkan storage images and buffers.

GPUs seem to become more universal over time, with more and more workloads done as compute shaders these days. Will we end up with some generic, highly parallel compute machines with no fixed-function hardware? I don’t know. But Nanite technology from the new Unreal Engine 5 makes a step in this direction by implementing its own rasterizer for some of its triangles, in form of a compute shader. I recommend a good article about it: “A Macro View of Nanite – The Code Corsair” (it seems the link is broken already - here is a copy on Wayback Machine Internet Archive). Apparently, for tiny triangles of around single pixel size, custom rasterization is faster than what GPUs provide by default.

But in the same article we can read that Epic also does something opposite in Nanite: they use some fixed-function parts of the graphics pipeline very creatively. When applying materials in screen space, they render a full-screen pass per each material, but instead of drawing just a full-screen triangle, they do a regular triangle grid with quads covering tiles of NxN pixels. They then perform a coarse-grained culling of these tiles in a vertex shader. In order to reject one, they output vertex position = NaN, which makes a triangle incorrect and not spawning any pixels. Then, a more fine-grained culling is performed using Z-test. Per-pixel material identifier is encoded as depth in a depth buffer! This can be fast, as modern GPUs apply “HiZ” - an internal optimization to reject whole groups of pixels that fail Z-test even before their pixel shaders are launched.

This reminded me of another creative use of the graphics pipeline I observed in one game a few years ago. That pass was calculating luminance histogram of a scene. They also rendered a regular grid of geometry in screen space, but with “point list” topology. Each vertex was sampling and calculating average luminance of its region. On the other end, the histogram texture of Nx1 pixels was bound as a render target. Measured luminance of a region was returned as vertex position, while incrementation of the specific place on the histogram was ensured using additive blending. I suspect this is not the most optimal way of doing this, a compute shader using atomics could probably do it faster, but it surely was very creative and took me some time to figure out what that pass is really doing and how is it doing it.

After all, GPUs have many fixed-function elements next to their shader cores. Vertex fetch, texture sampling (with mip level calculation, trilinear and anisotropic filtering), tessellation, rasterization, blending, all kinds of primitive culling and pixel testing, even vertex homogeneous divide... Although not included in the calculation of TFLOPS power, these are real transistors with compute capabilities, just very specialized. Do you know any other smart, creative uses of them?

Comments | #rendering #optimization #gpu Share

# A Better Way to Scalarize a Shader

Tue
20
Oct 2020

This will be an advanced article. It assumes you not only know how to write shaders but also how they work on a low level (like vector versus scalar registers) and how to optimize them using scalarization. It all starts from a need to index into an array of texture or buffer descriptors, where the index is dynamic – it may vary from pixel to pixel. This is useful e.g. when doing bindless-style rendering or blending various layers of textures e.g. on a terrain. To make it working properly in a HLSL shader, you need to surround the indexing operation with a pseudo-function NonUniformResourceIndex. See also my old blog post “Direct3D 12 - Watch out for non-uniform resource index!”.

Texture2D g_Textures[] : register(t1);
...
return g_Textures[NonUniformResourceIndex(textureIndex)].Load(pos);

In many cases, it is enough. The driver will do its magic to make things working properly. But if your logic dependent on textureIndex is more complex than a single Load or SampleGrad, e.g. you sample multiple textures or do some calculations (let's call it MyDynamicTextureIndexing), then it might be beneficial to scalarize the shader manually using a loop and wave functions from HLSL Shader Model 6.0.

I learned how to do scalarization from the 2-part article “Intro to GPU Scalarization” by Francesco Cifariello Ciardi and the presentation “Improved Culling for Tiled and Clustered Rendering” by MichaƂ Drobot, linked from it. Both sources propose an implementation like the following HLSL snippet:

// WORKING, TRADITIONAL
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
while(any(currThreadMask & activeThreadsMask) != 0)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   uint2 scalarTextureIndexThreadMask = WaveActiveBallot(scalarTextureIndex == textureIndex).xy;
   activeThreadsMask &= ~scalarTextureIndexThreadMask;
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
   }
}
return color;

It involves a bit mask of active threads. From the moment I first saw this code, I started wondering: Why is it needed? A mask of threads that still want to continue spinning the loop is already maintained implicitly by the shader compiler. Couldn't we just break; from the loop when done with the textureIndex of the current thread?! So I wrote this short piece of code:

// BAD, CRASHES
float4 color = float4(0.0, 0.0, 0.0, 0.0);
while(true)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
       break;
   }
}
return color;

…and it crashed my GPU. At first I thought it may be a bug in the shader compiler, but then I recalled footnote [2] in part 2 of the scalarization tutorial, which mentions an issue with helper lanes. Let me elaborate on this. When a shader is executed in SIMT fashion, individual threads (lanes) may be active or inactive. Active lanes are these that do their job. Inactive lanes may be inactive from the very beginning because we are at the edge of a triangle so there are not enough pixels to make use of all the lanes or may be disabled temporarily because e.g. we are executing an if section that some threads didn't want to enter. But in pixel shaders there is a third kind of lanes – helper lanes. These are used instead of inactive lanes to make sure full 2x2 quads always execute the code, which is needed to calculate derivatives ddx/ddy, also done explicitly when sampling a texture to calculate the correct mip level. A helper lane executes the code (like an active lane), but doesn't export its result to the render target (like an inactive lane).

As it turns out, helper lanes also don't contribute to wave functions – they work like inactive lanes. Can you already see the problem here? In the loop shown above, it may happen than a helper lane has its textureIndex different from any active lanes within a wave. It will then never get its turn to process it in a scalar fashion, so it will fall into an infinite loop, causing GPU crash (TDR)!

Then I thought: What if I disable helper lanes just once, before the whole loop? So I came up with the following shader. It seems to work fine. I also think it is better than the first solution, as it operates on the thread bit mask only once at the beginning and so uses fewer variables to be stored in GPU registers and does fewer calculations in every loop iteration. Now I'm thinking whether there is something wrong with my idea that I can't see now? Or did I just invent a better way to scalarize shaders?

// WORKING, NEW
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
[branch]
if(any((currThreadMask & activeThreadsMask) != 0))
{
   while(true)
   {
       uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
       [branch]
       if(scalarTextureIndex == textureIndex)
       {
           color = MyDynamicTextureIndexing(textureIndex);
           break;
       }
   }
}
return color;

UPDATE 2020-10-28: There are some valuable comments under my tweet about this topic that I recommend to check out.

UPDATE 2021-12-03: As my colleague pointed out today, the code I showed above as "BAD" is perfectly fine for compute shaders. Only in pixel shaders we have problems with helper lanes. Thank you Steffen!

Comments | #directx #optimization #gpu Share

# Which Values Are Scalar in a Shader?

Wed
14
Oct 2020

GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it cbScaleFactor, or any calculation on such data, like (cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.

Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like SV_Position in a pixel shader (denoting the position of the current pixel on the screen), SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.

But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.

Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.

This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like WaveReadLaneFirst, WaveActiveMax, WaveActiveAllTrue return the same value for all threads, so it’s always scalar.

Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.

Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed LaneIndex to be any value by making this GitHub commit to their documentation.

If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.

SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.

SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.

SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using [numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases, SV_GroupThreadID.y wouldn’t be scalar.

Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.

Comments | #gpu #directx #vulkan #optimization Share

# System Value Semantics in Compute Shaders - Cheat Sheet

Tue
29
Sep 2020

After compute shaders appeared, programmers no longer need to pretend they do graphics and render pixels when they want to do some general-purpose computations on a GPU (GPGPU). They can just dispatch a shader that reads and writes memory in a custom way. Such shader is a short (or not so short) program to be invoked thousands or millions of times to process a piece of data. To work correctly, it needs to know which is the current thread. Threads (invocations) of a compute shader are not just indexed linearly as 0, 1, 2, ... It's more complex than that. Their indexing can use up to 3 dimensions, which simplifies operation on some data like images or matrices. They also come in groups, with the number of threads in one group declared statically as part of the shader code and the number of groups to execute passed dynamically in CPU code when dispatching the shader.

This raises a question of how to identify the current thread. HLSL offers a number of system-value semantics for this purpose and so does GLSL by defining equivalent built-in variables. For long time I couldn't remember their names, as the ones in HLSL are quite misleading. If GroupID is an ID of the entire group, and GroupThreadID is an ID of the thread within a group, GroupIndex should be a flattened index of the entire group, right? Wrong! It's actually an index of a single thread within a group. GLSL is more consistent in this regard, clearly stating "WorkGroup" versus "Invocation" and "Local" versus "Global". So, although Microsoft provides a great explanation of their SVs with a picture on pages like SV_DispatchThreadID, I thought it would be nice to gather all this in form of a table, a small cheat sheet:

HLSL SemanticsGLSL VariableType (Dimension)UnitReference
SV_GroupIDgl_WorkGroupIDuint3 (3D)Entire groupGlobal in dispatch
SV_GroupThreadIDgl_LocalInvocationIDuint3 (3D)Single threadLocal in group
SV_DispatchThreadIDgl_GlobalInvocationIDuint3 (3D)Single threadGlobal in dispatch
SV_GroupIndexgl_LocalInvocationIndexuint (flattened)Single threadLocal in group

Comments | #gpu #directx #opengl #vulkan Share

Pages: 1 2 3 >

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2022