Entries for tag "vulkan", ordered from most recent. Entry count: 32.
# VkExtensionsFeaturesHelp - My New Library
I had this idea for quite some time and finally I've spent last weekend coding it, so here it is: 611 lines of code (and many times more of documentation), shared for free on MIT license:
Vulkan Extensions & Features Help, or VkExtensionsFeaturesHelp, is a small, header-only, C++ library for developers who use Vulkan API. It helps to avoid boilerplate code while creating
VkDevice object by providing a convenient way to query and then enable:
The library provides a domain-specific language to describe the list of required or supported extensions, features, and layers. The language is fully defined in terms of preprocessor macros, so no custom build step is needed.
Any feedback is welcome :)
# Vulkan Memory Types on PC and How to Use Them
Allocation of memory for buffers and textures is one of the fundamental things we do when using graphics APIs, like DirectX or Vulkan. It is of my particular interest as I develop Vulkan Memory Allocator and D3D12 Memory Allocator libraries (as part of my job – these are not personal projects). Although underlying hardware (RAM dice and GPU) stay the same, different APIs expose them differently. I’ve described these differences in detail in my article “Differences in memory management between Direct3D 12 and Vulkan”. I also gave a talk “Memory management in Vulkan and DX12” at GDC 2018 and my colleague Ste Tovey presented much more details in his talk “Memory Management in Vulkan” at Vulkanised 2018.
In this article, I would like to present common patterns seen on the list of memory types available in Vulkan on Windows PCs. First, let me recap what the API offers: Unlike in DX12, where you have just 3 default “memory heap types” (
D3D12_HEAP_TYPE_READBACK), in Vulkan there is a 2-level hierarchy, a list of “memory heaps” and “memory types” inside them you need to query and that can look completely different on various GPUs, operating systems, and driver versions. Some constraints and guarantees apply, as described in Vulkan specification, e.g. there is always some
DEVICE_LOCAL and some
HOST_VISIBLE memory type.
A memory heap, as queried from
vkGetPhysicalDeviceMemoryProperties and returned in
VkMemoryHeap, represents some (more or less) physical memory, e.g. video RAM on the graphics card or system RAM on the motherboard. It has some fixed size in bytes, and current available budget that can be queried using extension VK_EXT_memory_budget. A memory type, as returned in
VkMemoryType, belongs to certain heap and offers a “view” to that heap with certain properties, represented by
VkMemoryPropertyFlags. Most notable are:
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, which always matches flag
VK_MEMORY_HEAP_DEVICE_LOCAL_BITin the heap it belongs to, informs that the memory is local to the “device” (the GPU in Vulkan terminology). It doesn’t change what you can or cannot do with this memory type. If creating certain buffers or textures was possible only in GPU and not CPU memory, it would be expressed by appropriate bits not set in
DEVICE_LOCALflag set in a memory type is just a hint for us that resources created in that memory will probably work faster when accessed on the GPU.
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT. Unlike previous one, this flag changes a lot. It means that you can call
VkDeviceMemoryobjects allocated from this type and get a raw, CPU-side pointer to its data. In short: you can access this memory directly from the CPU, without a need to launch a Vulkan command for explicit transfer, like
VK_MEMORY_PROPERTY_HOST_CACHED_BIT, which can occur only on memory types that are also
HOST_VISIBLE. This one again is just a hint for us. It changes nothing we can or cannot do with that memory. It just informs us that access to this memory will go through cache (from CPU perspective). As a result, a memory type with this flag should be fast to write, read, and access randomly via mapped pointer. What the lack of this flag means is not clearly defined. Such memory may represent system RAM or even video RAM, but a common meaning (at least on PC) is that accesses are then uncached but write-combined from CPU perspective, which means we should only write to it sequentially (best to do memcpy), never read from it or jump over random places, as it may be slow.
Theoretically, a good algorithm as recommended by the spec, to search for the first memory type meeting your requirements, should be robust enough to work on any GPU, but, if you want make sure your application works correctly and efficiently on a variety of graphics hardware available on the market today, you may need to adjust your resource management policy to a specific set of memory heaps/types found on a user’s machine. To simplify this task, below I present common patterns that can be observed on the list of Vulkan memory heaps and types on various GPUs, on Windows PCs. I also describe their meaning and consequences.
Before I start, I must show you website vulkan.gpuinfo.org, if you don’t already know it. It is a great database of all Vulkan capabilities, features, limits, and extensions, including memory heaps/types, cataloged from all kinds of GPUs and operating systems.
1. The Intel way
Intel manufactures integrated graphics (although they also released a discrete card recently). As GPU integrated into CPU, it shares the same memory. It then makes sense to expose following memory types in Vulkan (example: Intel(R) UHD Graphics 600):
Heap 0: DEVICE_LOCAL
Size = 1,849,059,532 B
Type 0: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT
Type 1: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT, HOST_CACHED
What it means: The simplest and the most intuitive set of memory types. There is just one memory that represents system RAM, or a part of it that can be used for graphics resources. All memory types are
DEVICE_LOCAL, which means GPU has fast access to them. They are also all
HOST_VISIBLE – accessible to the CPU. Type 0 without
HOST_CACHED flag is good for writing through mapped pointer and reading by the GPU, while type 1 with
HOST_CACHED flag is good for writing by the GPU commands and reading via mapped pointer.
How to use it: You can just load your resources directly from disk. There is no need to create a separate staging copy, separate GPU copy, and issue a transfer command, like we do with discrete graphics cards. With images you need to use
VK_IMAGE_TILING_OPTIMAL for best performance and so you need to
vkCmdCopyBufferToImage, but at least for buffers you can just map them, fill the content via CPU pointer and then tell GPU to use that memory – an approach which can save both time and precious bytes of memory.
# Which Values Are Scalar in a Shader?
GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it
cbScaleFactor, or any calculation on such data, like
(cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.
Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like
SV_Position in a pixel shader (denoting the position of the current pixel on the screen),
SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.
But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.
Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.
This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like
WaveActiveAllTrue return the same value for all threads, so it’s always scalar.
Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword
NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.
Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed
LaneIndex to be any value by making this GitHub commit to their documentation.
If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.
SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.
SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.
SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using
[numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that
SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases,
SV_GroupThreadID.y wouldn’t be scalar.
Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.
# System Value Semantics in Compute Shaders - Cheat Sheet
After compute shaders appeared, programmers no longer need to pretend they do graphics and render pixels when they want to do some general-purpose computations on a GPU (GPGPU). They can just dispatch a shader that reads and writes memory in a custom way. Such shader is a short (or not so short) program to be invoked thousands or millions of times to process a piece of data. To work correctly, it needs to know which is the current thread. Threads (invocations) of a compute shader are not just indexed linearly as 0, 1, 2, ... It's more complex than that. Their indexing can use up to 3 dimensions, which simplifies operation on some data like images or matrices. They also come in groups, with the number of threads in one group declared statically as part of the shader code and the number of groups to execute passed dynamically in CPU code when dispatching the shader.
This raises a question of how to identify the current thread. HLSL offers a number of system-value semantics for this purpose and so does GLSL by defining equivalent built-in variables. For long time I couldn't remember their names, as the ones in HLSL are quite misleading. If
GroupID is an ID of the entire group, and
GroupThreadID is an ID of the thread within a group,
GroupIndex should be a flattened index of the entire group, right? Wrong! It's actually an index of a single thread within a group. GLSL is more consistent in this regard, clearly stating "WorkGroup" versus "Invocation" and "Local" versus "Global". So, although Microsoft provides a great explanation of their SVs with a picture on pages like SV_DispatchThreadID, I thought it would be nice to gather all this in form of a table, a small cheat sheet:
|HLSL Semantics||GLSL Variable||Type (Dimension)||Unit||Reference|
|SV_GroupID||gl_WorkGroupID||uint3 (3D)||Entire group||Global in dispatch|
|SV_GroupThreadID||gl_LocalInvocationID||uint3 (3D)||Single thread||Local in group|
|SV_DispatchThreadID||gl_GlobalInvocationID||uint3 (3D)||Single thread||Global in dispatch|
|SV_GroupIndex||gl_LocalInvocationIndex||uint (flattened)||Single thread||Local in group|
# Why Not Use Heterogeneous Multi-GPU?
There was an interesting discussion recently on one Slack channel about using integrated GPU (iGPU) together with discrete GPU (dGPU). Many sound ideas were said there, so I think it's worth writing them down. But because I probably never blogged about multi-GPU before, few words of introduction first:
The idea to use multiple GPUs in one program is not new, but not very widespread either. In old graphics APIs like Direct3D 11 it wasn't easy to implement. Doing it right in a complex game often involved engaging driver engineers from the GPU manufacturer (like AMD, NVIDIA) or using custom vendor extensions (like AMD GPU Services - see for example Explicit Crossfire API).
New generation of graphics APIs – Direct3D 12 and Vulkan – are lower level, give more direct access to the hardware. This includes the possibility to implement multi-GPU support on your own. There are two modes of operation. If the GPUs are identical (e.g. two graphics cards of the same model plugged to the motherboard), you can use them as one device object. In D3D12 you then index them as Node 0, Node 1, ... and specify
NodeMask bit mask when allocating GPU memory, submitting commands and doing all sorts of GPU things. Similarly, in Vulkan you have VK_KHR_device_group extension available that allows you to create one logical device object that will use multiple physical devices.
But this post is about heterogeneous/asymmetric multi-GPU, where there are two different GPUs installed in the system, e.g. one integrated with the CPU and one discrete. A common example is a laptop with "switchable graphics", which may have an Intel CPU with their integrated “HD” graphics plus a NVIDIA GPU. There may even be two different GPUs from the same manufacturer! My new laptop (ASUS TUF Gaming FX505DY) has AMD Radeon Vega 8 + Radeon RX 560X. Another example is a desktop PC with CPU-integrated graphics and a discrete graphics card installed. Such combination may still be used by a single app, but to do that, you must create and use two separate Device objects. But whether you could, doesn't mean you should…
First question is: Are there games that support this technique? Probably few… There is just one example I heard of: Ashes of the Singularity by Oxide Games, and it was many years ago, when DX12 was still fresh. Other than that, there are mostly tech demos, e.g. "WITCH CHAPTER 0 [cry]" by Square Enix as described on DirectX Developer Blog (also 5 years old).
iGPU typically has lower computational power than dGPU. It could accelerate some pieces of computations needed each frame. One idea is to hand over the already rendered 3D scene to the iGPU so it can finish it with screen-space postprocessing effects and present it, which sounds even better if the display is connected to iGPU. Another option is to accelerate some computations, like occlusion culling, particles, or water simulation. There are some excellent learning materials about this technique. The best one I can think of is: Multi-Adapter with Integrated and Discrete GPUs by Allen Hux (Intel), GDC 2020.
However, there are many drawbacks of this technique, which were discussed in the Slack chat I mentioned:
Conclusion: Supporting heterogeneous multi-GPU in a game engine sounds like an interesting technical challenge, but better think twice before doing it in a production code.
BTW If you just want to use just one GPU and worry about the selection of the right one, see my old post: Switchable graphics versus D3D11 adapters.
# Texture Compression: What Can It Mean?
"Data compression - the process of encoding information using fewer bits than the original representation." That's the definition from Wikipedia. But when we talk about textures (images that we use while rendering 3D graphics), it's not that simple. There are 4 different things we can mean by talking about texture compression, some of them you may not know. In this article, I'd like to give you some basic information about them.
1. Lossless data compression. That's the compression used to shrink binary data in size losing no single bit. We may talk about compression algorithms and libraries that implement them, like popular zlib or LZMA SDK. We may also mean file formats like ZIP or 7Z, which use these algorithms, but also define a way to pack multiple files with their whole directory structure into a single archive file.
Important thing to note here is that we can use this compression for any data. Some file types like text documents or binary executables have to be compressed in a lossless way so that no bits are lost or altered. You can also compress image files this way. Compression ratio depends on the data. The size of the compressed file will be smaller if there are many repeating patterns - the data look pretty boring, like many pixels with the same color. If the data is more varying, each next pixel has even slightly different value, then you may end up with a compressed file as large as original one or even larger. For example, following two images have size 480 x 480. When saved as uncompressed BMP R8G8B8 file, they both take 691,322 bytes. When compressed to a ZIP file, the first one is only 15,993, while the second one is 552,782 bytes.
We can talk about this compression in the context of textures because assets in games are often packed into archives in some custom format which protects the data from modification, speeds up loading, and may also use compression. For example, the new Call of Duty Warzone takes 162 GB of disk space after installation, but it has only 442 files because developers packed the largest data in some archives in files Data/data/data.000, 001 etc., 1 GB each.
2. Lossy compression. These are the algorithms that allow some data loss, but offer higher compression ratios than lossless ones. We use them for specific kinds of data, usually some media - images, sound, and video. For video it's virtually essential, because raw uncompressed data would take enormous space for each second of recording. Algorithms for lossy compression use the knowledge about the structure of the data to remove the information that will be unnoticeable or degrade quality to the lowest degree possible, from the perspective of human perception. We all know them - these are formats like JPEG for images and MP3 for music.
They have their pros and cons. JPEG compresses images in 8x8 blocks using Discrete Fourier Transform (DCT). You can find awesome, in-depth explanation of it on page: Unraveling the JPEG. It's good for natural images, but with text and diagrams it may fail to maintain desired quality. My first example saved as JPEG with Quality = 20% (this is very low, I usually use 90%) takes only 24,753 B, but it looks like this:
GIF is good for such synthetic images, but fails on natural images. I saved my second example as GIF with a color palette of 32 entries. The file is only 90,686 B, but it looks like this (look closer to see dithering used due to a limited number of colors):
Lossy compression is usually accompanied by lossless compression - file formats like JPEG, GIF, MP3, MP4 etc. compress the data losslessly on top of its core algorithm, so there is no point in compressing them again.
3. GPU texture compression. Here comes the interesting part. All formats described so far are designed to optimize data storage and transfer. We need to decompress all the textures packed in ZIP files or saved as JPEG before uploading them to video memory and using for rendering. But there are other types of texture compression formats that can be used by the GPU directly. They are lossy as well, but they work in a different way - they use a fixed number of bytes per block of NxN pixels. Thanks to this, a graphics card can easily pick right block from the memory and uncompress it on the fly, e.g. while sampling the texture. Some of such formats are BC1..7 (which stands for Block Compression) or ASTC (used on mobile platforms). For example, BC7 uses 1 byte per pixel, or 16 bytes per 4x4 block. You can find some overview of these formats here: Understanding BCn Texture Compression Formats.
The only file format I know which supports this compression is DDS, as it allows to store any texture that can be loaded straight to DirectX in various pixel formats, including not only block compressed but also cube, 3D, etc. Most game developers design their own file formats for this purpose anyway, to load them straight into GPU memory with no conversion.
4. Internal GPU texture compression. Pixels of a texture may not be stored in video memory the way you think - row-major order, one pixel after the other, R8G8B8A8 or whatever format you chose. When you create a texture with
VK_IMAGE_TILING_OPTIMAL (always do that, except for some very special cases), the GPU is free to use some optimized internal format. This may not be true "compression" by its definition, because it must be lossless, so the memory reserved for the texture will not be smaller. It may even be larger because of the requirement to store additional metadata. (That's why you have to take care of extra
VK_IMAGE_ASPECT_METADATA_BIT when working with sparse textures in Vulkan.) The goal of these formats is to speed up access to the texture.
Details of these formats are specific to GPU vendors and may or may not be public. Some ideas of how a GPU could optimize a texture in its memory include:
How to make best use of those internal GPU compression formats if they differ per graphics card vendor and we don't know their details? Just make sure you leave the driver as much optimization opportunities as possible by:
VK_SHARING_MODE_CONCURRENTfor any textures that don't need them,
VK_IMAGE_CREATE_MUTABLE_FORMAT_BITfor any textures that don't need them,
See also article Delta Color Compression Overview at GPUOpen.com.
Summary: As you can see, the term "texture compression" can mean different things, so when talking about anything like this, always make sure to be clear what do you mean unless it's obvious from the context.
# Vulkan Memory Allocator - budget management
It also contains documentation of all new symbols and a general chapter "Staying within budget" that describes this topic. Documentation is pregenerated so it can be accessed by just downloading the repository as ZIP, unpacking, and opening file "docs\html\index.html" > chapter “Staying within budget”.
If you are interested, please take a look. Any feedback is welcomed - you can leave your comment below or send me an e-mail. Now is the best time to adjust this feature to users' needs before it gets into the official release of the library.
Long story short:
VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT, which causes the allocation to just return failure if it would go over budget.
Update 2019-12-20: This has been merged to master branch and shipped with the latest major release: Vulkan Memory Allocator 2.3.0.
# Differences in memory management between Direct3D 12 and Vulkan
Since July 2017 I develop Vulkan Memory Allocator (VMA) – a C++ library that helps with memory management in games and other applications using Vulkan. But because I deal with both Vulkan and DirectX 12 in my everyday work, I think it’s a good idea to compare them.
This is an article about a very specific topic. It may be useful to you if you are a programmer working with both graphics APIs – Direct3D 12 and Vulkan. These two APIs offer a similar set of features and performance. Both are the new generation, explicit, low-level interfaces to the modern graphics hardware (GPUs), so we could compare them back-to-back to show similarities and differences, e.g. in naming things. For example,
ID3D12CommandQueue::ExecuteCommandLists function has Vulkan equivalent in form of
vkQueueSubmit function. However, this article focuses on just one aspect – memory management, which means the rules and limitation of GPU memory allocation and the creation of resources – images (textures, render targets, depth-stencil surfaces etc.) and buffers (vertex buffers, index buffers, constant/uniform buffers etc.) Chapters below describe pretty much all the aspects of memory management that differ between the two APIs.