Entries for tag "vulkan", ordered from most recent. Entry count: 36.
# Doing dynamic resolution scaling? Watch out for texture memory size!
This article is intended for graphics programmers, mostly those who use Direct3D 12 or Vulkan and implement dynamic resolution scaling. Before we go to the main topic, some introduction first…
Nowadays, more and more games offer some kind resolution scaling. It means rendering the 3D scene in a resolution lower than the display resolution and then upscaling it using some advanced shader, often combined with temporal antialiasing and sharpening. It may be one of the solutions provided by GPU vendors (FSR from AMD, XeSS from Intel, DLSS from NVIDIA) or a custom solution (like TSR in Unreal Engine). It is an attractive option for gamers to have a good FPS increase with only minor image quality degradation. It is becoming more important as monitor resolutions increase to 4K or even more, high-end graphics cards are still expensive, and advanced rendering techniques like ray tracing encourage to favor “better pixels” over “more pixels”. See also my old article: “Scaling is everywhere, pixel-perfect is the past”.
Dynamic resolution scaling is an extension to this idea that allows rendering each frame in a different resolution, lower or higher, as a trade-off between quality and performance, to maintain desired framerate even in more complex scenes with many objects, characters, and particle effects visible on the screen. If you are interested in this technique, I strongly recommend checking a recent article from Martin Fuller from Microsoft: “Dynamic Resolution Scaling (DRS) Implementation Best Practice”, which provides many practical implementation tips.
One of the topics we need to handle when implementing dynamic resolution scaling is the creation and usage of textures that need different resolution every frame, especially render target, depth-stencil, and UAV, used temporarily between render passes. One solution could be to create these textures in the maximum resolution and use only part of them when necessary using a limited viewport. However, Martin gives multiple reasons why this option may cause some problems. A simpler and safer solution is to create a separate texture for each possible resolution, with a certain step. In modern graphics APIs (Direct3D 12 and Vulkan) they can be placed in the same memory, which we call memory aliasing.
Here comes the main question I want to answer in this article: What size of the memory heap should we use when allocating memory for these textures? Can we just take maximum dimensions of a texture (e.g. 4K resolution: 3840 x 2160), call
device->GetResourceAllocationInfo(), inspect returned
D3D12_RESOURCE_ALLOCATION_INFO::SizeInBytes and use it as
D3D12_HEAP_DESC::SizeInBytes? A texture with less pixels should always require less memory, right?
WRONG! Direct3D 12 doesn’t define such a requirement and graphics drivers from some GPU vendors really return smaller size required for a texture with larger dimensions, for some specific dimensions and pixel formats. For example, on AMD Radeon RX 7900 XTX, a render target with format
Why does this happen? It is because textures are not necessarily stored in the GPU memory in a way we imagine them: pixel-after-pixel, row major order. They often use some optimization techniques like pixel swizzling or compression. By “compression”, I don’t mean texture formats like BC or ASTC, which we must use explicitly. I also don’t mean compression like in ZIP file format or zlib/deflate algorithm that decrease data size. Quite the opposite: this kind of compression increases texture size by adding extra metadata, which allow to speed things up by saving memory bandwidth in certain cases. This is done mostly on render target and depth-stencil textures. For more information about it, see my old article: “Texture Compression: What Can It Mean?”. I’m talking about the meaning of the word “compression” number 4 from that article – compression formats that are internal, specific to certain graphics cards, and opaque for us – programmers who just use the graphics API. Problem is that a specific compression format for a texture is selected by the driver based on various heuristics (like render target / depth-stencil / UAV / other flags, pixel format, and… dimensions). This is why a texture with larger dimensions may unexpectedly require less memory.
To research this problem in details, I’ve written a small testing program and I performed tests on graphics cards from various vendors. It was a modification of my small Windows console app D3d12info that goes through the list of all
DXGI_FORMAT enum values, calls
CheckFeatureSupport to check which ones are supported as a render target or depth-stencil. For those that do, I called
GetResourceAllocationInfo to get memory requirements for a texture with this pixel format, with increasing dimensions, where height goes from 32 to 2160 with a step of 8, and width is calculated using a formula for 16:9 aspect ratio: width = height * 16 / 9.
Here are the results. Please remember these are just 3 specific graphics cards. The results may be different on a different GPU and even with a different version of the graphics driver.
On NVIDIA GeForce RTX 3080 with driver 545.84, I found no cases where a texture with larger dimensions requires less memory, so NVIDIA (or at least this specific card) is not affected by the problem described in this article.
On AMD Radeon RX 7900 XTX with driver 23.9.3, I found following data points where memory requirements are non-monotonic – one for each of the following formats:
DXGI_FORMAT_R16G16B16A16_FLOAT/UNORM/UINT/SNORM/SINT: 256x144 = 458,752 B, 270x152 = 393,216 B
DXGI_FORMAT_R32G32_FLOAT/UINT/SINT: 256x144 = 458,752 B, 270x152 = 393,216 B
DXGI_FORMAT_R8G8_UNORM/UINT/SNORM/SINT: 512x288 = 458,752 B, 526x296 = 393,216 B
DXGI_FORMAT_R16_FLOAT/UNORM/UINT/SNORM/SINT: 512x288 = 458,752 B, 526x296 = 393,216 B
DXGI_FORMAT_R8_UNORM/UINT/SNORM/SINT: 256x144 = 131,072 B, 270x152 = 65,536 B
DXGI_FORMAT_A8_UNORM: 256x144 = 131,072 B, 270x152 = 65,536 B
DXGI_FORMAT_B5G6R5_UNORM: 512x288 = 458,752 B, 526x296 = 393,216 B
DXGI_FORMAT_B5G5R5A1_UNORM: 512x288 = 458,752 B, 526x296 = 393,216 B
DXGI_FORMAT_B4G4R4A4_UNORM: 512x288 = 458,752 B, 526x296 = 393,216 B
On Intel Arc A770, with driver 22.214.171.12487, almost every format used as a render target (but none of depth-stencil formats) has multiple steps where the size decreases, and it has them at larger dimensions than AMD. For example, the most “traditional” one –
What to do with this knowledge? The conclusion is that if we implement dynamic resolution scaling and we want to create textures with different dimensions aliasing in memory, required size of this memory is not necessarily the size of the largest texture in terms of dimensions. To be safe, we should query for memory requirements of all texture sizes we may want to use and calculate their maximum. In practice, it should be enough to query resolutions starting from e.g. 75% of the maximum. Because tested GPUs always have only a single step down, an even more efficient, but not fully future-proof solution could be to start from the full resolution, go down until we find a different memory size (no matter if higher or lower), and take maximum of these two.
So far, I focused only on DirectX 12. Is Vulkan also affected by this problem? In the past, it could be. Vulkan has similar concept of querying for memory requirements of a texture using function
vkGetImageMemoryRequirements. It used to have an even bigger problem. To understand it, we must recall that in D3D12, we query for memory requirements (size and alignment) given structure
D3D12_RESOURCE_DESC which describes parameters of a texture to be created. In (the initial) Vulkan API, on the other hand, we need to first create the actual
VkImage object, and then query for its memory requirements. Question is: Given two textures created with exactly same parameters (width, height, pixel format, number of mip levels, flags, etc.), do they always return the same memory requirements?
In the past, it wasn’t required by the Vulkan specification and I saw some drivers for some GPUs that really returned different sizes for two identical textures! It could cause problems, e.g. when defragmenting video memory in Vulkan Memory Allocator library. Was it a bug, or another internal optimization done by the driver, e.g. to avoid some memory bank conflicts? I don’t know. Good news is that since then, Vulkan specification was clarified to require that functions like
vkGetImageMemoryRequirements always return the same size and alignment for images created with the same parameters, and new drivers comply with that, so the problem is gone now. Vulkan 1.3 also got a new function
vkGetDeviceImageMemoryRequirements that takes
VkImageCreateInfo with image creation parameters instead of an already created image object, just like D3D12 does from the beginning.
Going back to the main question of this article: When VK_KHR_maintenance4 extension is enabled (which has been promoted to core Vulkan 1.3), the problem does not occur, as Vulkan specification says: "For a VkImage, the size memory requirement is never greater than that of another VkImage created with a greater or equal value in each of extent.width, extent.height, and extent.depth; all other creation parameters being identical.", and the same for buffers.
Big thanks to my friends: Bartek Boczula for discussions about this topic and inspiration to write this article, as well as Szymon Nowacki for testing on the Intel card! Also thanks to Constantine Shablia from Collabora for pointing me to the answer on Vulkan.
# Impressions from Vulkanised 2023 Conference
Last week I attended Vulkanised conference. It is an official conference of Vulkan API. It took place 7-9 February 2023 in Munich, Germany. It was my first time at this conference. My attendance was part of my job at AMD and I co-presented with Valve about using Radeon Developer Tools on RADV (Linux AMD driver) and Steam Deck. Here, on my blog, I would like to share my personal impressions from the event.
Overall, it was well organized. There were over 200 attendees, 3 days full of talks, most of them short (20-30 minutes, some of them even 10 minutes!), happening on just one scene (apart from full-day Vulkan tutorial for beginners, happening on the first day in parallel with normal talks), with lunch break and coffee breaks in between, so everyone could see everything without a need to choose from the timetable which talks to attend. It was intense. Every evening we went for some good food and beer, which I enjoy a lot every time I visit Munich/Bavaria/Germany.
In terms of people attending, a conference like this differs completely from game developer conferences that I usually attend. On one hand, everyone there was a programmer who knows and uses Vulkan, so everyone was on the same page. On gamedev conferences, there are people from different fields, as game development is multidisciplinary - graphics and music artists, designers, programmers, business people etc. On the other hand, there were not so many people from game industry there, and if anyone, they were mostly from the world of mobile GPUs, not PC or console. It was interesting to talk with developers from various industries, using GPUs and Vulkan for different applications, like scientific computations and visualizations or even… software for cloth design for fashion business.
There were many interesting talks. I think the most valuable ones were about components of the Vulkan ecosystem that are useful to every developer, like Vulkan validation layers, VkConfigurator, Vulkan loader, or GFXReconstruct (which also added support for Direct3D 12 recently, by the way!). There were long and extensive talks teaching two recent big additions to the API: mesh shaders and Vulkan Video. Vulkan Video seems to be especially complicated, partially because it requires some knowledge of video encoding/decoding, which is something different from 3D rendering. I used to work for television, so it was not that obscure for me. But this new part of the API is also very low level. The decision to make encoding/decoding of every frame stateless, with all the state of the video stream managed by the user, makes the API surface very extensive.
Talk about Diligent Engine was interesting. I didn’t look at the project itself, but the presentation looked convincing that this is a good multi-platform 3D graphics library implemented on top of various graphics APIs. Another interesting project presentation was about VkFFT - a C library that calculates FFT on the GPU using one of many supported APIs (not only Vulkan) with state-of-the-art performance. It is implemented by assembling a string with the source code of a kernel optimized for a specific case.
Presentations about game optimization for mobile GPUs were very interesting to me. Optimizing games is what I do in my everyday job, although I work with “large” PC GPUs. I consider such talks with a collection of tips and recommendations exceptionally valuable. From these presentations, I could learn what things work fast on smartphone and tablet chips, which are different from PC and console chips. They said that on these platforms, energy consumption and bandwidth to and from memory is the most important. Because mobile GPUs are tile-based, a large amount of vertices or fat vertex format is very slow, which is not the case on PC. Also because of that, they recommend to group as many passes as possible as sub-passes of a single Vulkan render pass, even to a degree that rendering of 3D objects could be grouped together with screen-space postprocessing effects. Again, it isn’t a thing that we normally do on PCs. It was also interesting to see how they measure performance. While I always disable V-sync and just measure FPS in games, they seem to give multiple columns with results, including FPS, but also GPU utilization %, which is likely used when reaching 60 FPS with V-sync always enabled.
But more than any specific presentation, it was interesting for me to hear some general ideas about Vulkan, often repeated by multiple people. There were people from Khronos and LunarG there (the company that develops Vulkan SDK), so we could hear from and ask questions to people who really make this API. There was a discussion panel with many prominent participants who shared their voice on these topics. Noone said “what happens on Vulkanised stays on Vulkanised”, so here are some things I remember. Disclaimer: These are my personal, subjective impressions. I might remember something wrong. Please feel free to leave a comment with your own thoughts below this article.
Some profound things have been said about Vulkan. Someone said it’s not a graphics API, more like a Hardware Abstraction Layer (HAL) or an API for programming accelerators. They said it is a “design by compromise” rather than “design by committee”. They said we should think of Vulkan as not only the specification, by the entire ecosystem, including libraries, tools, code samples, learning materials, etc. I was pleased to hear that Vulkan Memory Allocator that I maintain was often mentioned as one of the examples. An open question is how many of these 3rd party components should be considered “canonical”. Many are already included in Vulkan SDK, but should official samples use them as well? Currently, they don’t, as they teach raw Vulkan. Someone also said that these ecosystem components should be properly funded. Another question was about the direction Vulkan should go. One person said it should probably become even more low-level, with app-space libraries on top of it more widely used.
It was surprising to see that there are solutions to run Vulkan above and below every other graphics API, which makes Vulkan a common ground across systems and APIs:
Among problems that developers have with using Vulkan and potential areas of development for the future, I noticed several common themes:
Overall, participation in Vulkansed conference was a great experience for me. I wish I will come back there. But Vulkan, even with its unprecedented openness, portability, and universality, is just part of the entire world of 3D graphics programming. On a conference dedicated to Vulkan I wouldn’t say loud that Direct3D 12 is more popular among PC game developers and it is not without a reason, or that maybe both these “explicit” APIs are at the worst possible level of abstraction - low level enough to be difficult to learn, to use, and easy to create bugs, while high-level enough to still hide hardware details crucial to squeezing maximum performance. But this is a separate topic…
When attending any event, I always pay attention to the quality of the audio-video system. On Vulkanised, it was very good. I especially liked the acoustics of the room, which clearly someone paid attention to when designing the interior. But there were some issues with presentation video that I don’t see too often. I blogged before about 3 Rules to Make You Image Looking Good on a Projector, where I mentioned potential problems with contrast, reproduction of colors or thin lines. Another time I described a possibility that edges of the screen may be cropped. But this conference had a different problem. Instead of connecting their laptops to a HDMI cable, speakers were asked to join an online meeting via Google Meet and share their screen there, with presentation on the big screen by another participant of that virtual call, streaming the content. We were in a Google office, after all :) This surely helped them record the presentations easily, but it also made any video or animation degraded to what looked like 2 FPS.
For more photos, see the official gallery 2023 Vulkanised by Khronos.
# Vulkan Memory Allocator 3.0.0 and D3D12 Memory Allocator 2.0.0
Yesterday we released new major version of Vulkan Memory Allocator 3.0.0 and D3D12 Memory Allocator 2.0.0, so if you are coding with Vulkan or Direct3D 12, I recommend to take a look at these libraries. Because coding them is part of my job, I won't describe them in detail here, but just refer to my article published on GPUOpen.com: "Announcing Vulkan Memory Allocator 3.0.0 and Direct3D 12 Memory Allocator 2.0.0". Direct links:
Vulkan Memory Allocator
D3D12 Memory Allocator
# First Look at New D3D12 Enhanced Barriers
This will be pretty advanced or at least intermediate article. It assumes you know Direct3D 12 API. Some references to Vulkan may also appear. I am writing it because I just found out that yesterday Microsoft announced an upcoming big change in D3D12: Enhanced Barriers. It will be an addition to the API that provides a new way to do barriers. Considering my professional interests, this looks very important to me and also quite revolutionary. This article summarizes my first look and my thoughts about this new addition to the API or, speaking in terms of modern internet, my "unboxing" or "reaction" ;)
Bill Kristiansen, the author of the article linked above, written that currently only the software-simulated WARP device supports the new enhanced barriers. Support in real GPU drivers will come at later time. The new barriers can replace the old way of doing them, but both will still be available and can also be mixed in one application. Which means this is not as big revolution to turn our DirectX development upside down - we can switch to them gradually. For now we can just prepare ourselves for the future by studying the interface (which I do in this article) and testing some code using WARP device.
UPDATE 2021-12-10: I just learned that Microsoft actually did publish a documentation of the new API: Enhanced Barriers @ DirectX-Specs, so I recommend to go see it before reading this article.
# VkExtensionsFeaturesHelp - My New Library
I had this idea for quite some time and finally I've spent last weekend coding it, so here it is: 611 lines of code (and many times more of documentation), shared for free on MIT license:
Vulkan Extensions & Features Help, or VkExtensionsFeaturesHelp, is a small, header-only, C++ library for developers who use Vulkan API. It helps to avoid boilerplate code while creating
VkDevice object by providing a convenient way to query and then enable:
The library provides a domain-specific language to describe the list of required or supported extensions, features, and layers. The language is fully defined in terms of preprocessor macros, so no custom build step is needed.
Any feedback is welcome :)
# Vulkan Memory Types on PC and How to Use Them
Allocation of memory for buffers and textures is one of the fundamental things we do when using graphics APIs, like DirectX or Vulkan. It is of my particular interest as I develop Vulkan Memory Allocator and D3D12 Memory Allocator libraries (as part of my job – these are not personal projects). Although underlying hardware (RAM dice and GPU) stay the same, different APIs expose them differently. I’ve described these differences in detail in my article “Differences in memory management between Direct3D 12 and Vulkan”. I also gave a talk “Memory management in Vulkan and DX12” at GDC 2018 and my colleague Ste Tovey presented much more details in his talk “Memory Management in Vulkan” at Vulkanised 2018.
In this article, I would like to present common patterns seen on the list of memory types available in Vulkan on Windows PCs. First, let me recap what the API offers: Unlike in DX12, where you have just 3 default “memory heap types” (
D3D12_HEAP_TYPE_READBACK), in Vulkan there is a 2-level hierarchy, a list of “memory heaps” and “memory types” inside them you need to query and that can look completely different on various GPUs, operating systems, and driver versions. Some constraints and guarantees apply, as described in Vulkan specification, e.g. there is always some
DEVICE_LOCAL and some
HOST_VISIBLE memory type.
A memory heap, as queried from
vkGetPhysicalDeviceMemoryProperties and returned in
VkMemoryHeap, represents some (more or less) physical memory, e.g. video RAM on the graphics card or system RAM on the motherboard. It has some fixed size in bytes, and current available budget that can be queried using extension VK_EXT_memory_budget. A memory type, as returned in
VkMemoryType, belongs to certain heap and offers a “view” to that heap with certain properties, represented by
VkMemoryPropertyFlags. Most notable are:
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, which always matches flag
VK_MEMORY_HEAP_DEVICE_LOCAL_BITin the heap it belongs to, informs that the memory is local to the “device” (the GPU in Vulkan terminology). It doesn’t change what you can or cannot do with this memory type. If creating certain buffers or textures was possible only in GPU and not CPU memory, it would be expressed by appropriate bits not set in
DEVICE_LOCALflag set in a memory type is just a hint for us that resources created in that memory will probably work faster when accessed on the GPU.
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT. Unlike previous one, this flag changes a lot. It means that you can call
VkDeviceMemoryobjects allocated from this type and get a raw, CPU-side pointer to its data. In short: you can access this memory directly from the CPU, without a need to launch a Vulkan command for explicit transfer, like
VK_MEMORY_PROPERTY_HOST_CACHED_BIT, which can occur only on memory types that are also
HOST_VISIBLE. This one again is just a hint for us. It changes nothing we can or cannot do with that memory. It just informs us that access to this memory will go through cache (from CPU perspective). As a result, a memory type with this flag should be fast to write, read, and access randomly via mapped pointer. What the lack of this flag means is not clearly defined. Such memory may represent system RAM or even video RAM, but a common meaning (at least on PC) is that accesses are then uncached but write-combined from CPU perspective, which means we should only write to it sequentially (best to do memcpy), never read from it or jump over random places, as it may be slow.
Theoretically, a good algorithm as recommended by the spec, to search for the first memory type meeting your requirements, should be robust enough to work on any GPU, but, if you want make sure your application works correctly and efficiently on a variety of graphics hardware available on the market today, you may need to adjust your resource management policy to a specific set of memory heaps/types found on a user’s machine. To simplify this task, below I present common patterns that can be observed on the list of Vulkan memory heaps and types on various GPUs, on Windows PCs. I also describe their meaning and consequences.
Before I start, I must show you website vulkan.gpuinfo.org, if you don’t already know it. It is a great database of all Vulkan capabilities, features, limits, and extensions, including memory heaps/types, cataloged from all kinds of GPUs and operating systems.
1. The Intel way
Intel manufactures integrated graphics (although they also released a discrete card recently). As GPU integrated into CPU, it shares the same memory. It then makes sense to expose following memory types in Vulkan (example: Intel(R) UHD Graphics 600):
Heap 0: DEVICE_LOCAL
Size = 1,849,059,532 B
Type 0: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT
Type 1: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT, HOST_CACHED
What it means: The simplest and the most intuitive set of memory types. There is just one memory that represents system RAM, or a part of it that can be used for graphics resources. All memory types are
DEVICE_LOCAL, which means GPU has fast access to them. They are also all
HOST_VISIBLE – accessible to the CPU. Type 0 without
HOST_CACHED flag is good for writing through mapped pointer and reading by the GPU, while type 1 with
HOST_CACHED flag is good for writing by the GPU commands and reading via mapped pointer.
How to use it: You can just load your resources directly from disk. There is no need to create a separate staging copy, separate GPU copy, and issue a transfer command, like we do with discrete graphics cards. With images you need to use
VK_IMAGE_TILING_OPTIMAL for best performance and so you need to
vkCmdCopyBufferToImage, but at least for buffers you can just map them, fill the content via CPU pointer and then tell GPU to use that memory – an approach which can save both time and precious bytes of memory.
# Which Values Are Scalar in a Shader?
GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it
cbScaleFactor, or any calculation on such data, like
(cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.
Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like
SV_Position in a pixel shader (denoting the position of the current pixel on the screen),
SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.
But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.
Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.
This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like
WaveActiveAllTrue return the same value for all threads, so it’s always scalar.
Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword
NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.
Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed
LaneIndex to be any value by making this GitHub commit to their documentation.
If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.
SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.
SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.
SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using
[numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that
SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases,
SV_GroupThreadID.y wouldn’t be scalar.
Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.
# System Value Semantics in Compute Shaders - Cheat Sheet
After compute shaders appeared, programmers no longer need to pretend they do graphics and render pixels when they want to do some general-purpose computations on a GPU (GPGPU). They can just dispatch a shader that reads and writes memory in a custom way. Such shader is a short (or not so short) program to be invoked thousands or millions of times to process a piece of data. To work correctly, it needs to know which is the current thread. Threads (invocations) of a compute shader are not just indexed linearly as 0, 1, 2, ... It's more complex than that. Their indexing can use up to 3 dimensions, which simplifies operation on some data like images or matrices. They also come in groups, with the number of threads in one group declared statically as part of the shader code and the number of groups to execute passed dynamically in CPU code when dispatching the shader.
This raises a question of how to identify the current thread. HLSL offers a number of system-value semantics for this purpose and so does GLSL by defining equivalent built-in variables. For long time I couldn't remember their names, as the ones in HLSL are quite misleading. If
GroupID is an ID of the entire group, and
GroupThreadID is an ID of the thread within a group,
GroupIndex should be a flattened index of the entire group, right? Wrong! It's actually an index of a single thread within a group. GLSL is more consistent in this regard, clearly stating "WorkGroup" versus "Invocation" and "Local" versus "Global". So, although Microsoft provides a great explanation of their SVs with a picture on pages like SV_DispatchThreadID, I thought it would be nice to gather all this in form of a table, a small cheat sheet:
|HLSL Semantics||GLSL Variable||Type (Dimension)||Unit||Reference|
|SV_GroupID||gl_WorkGroupID||uint3 (3D)||Entire group||Global in dispatch|
|SV_GroupThreadID||gl_LocalInvocationID||uint3 (3D)||Single thread||Local in group|
|SV_DispatchThreadID||gl_GlobalInvocationID||uint3 (3D)||Single thread||Global in dispatch|
|SV_GroupIndex||gl_LocalInvocationIndex||uint (flattened)||Single thread||Local in group|
Update 2023-08-30: There is another article about this topic that I recommend: "Dispatch IDs and you".