Vulkan Memory Types on PC and How to Use Them

Sun
21
Feb 2021

Allocation of memory for buffers and textures is one of the fundamental things we do when using graphics APIs, like DirectX or Vulkan. It is of my particular interest as I develop Vulkan Memory Allocator and D3D12 Memory Allocator libraries (as part of my job – these are not personal projects). Although underlying hardware (RAM dice and GPU) stay the same, different APIs expose them differently. I’ve described these differences in detail in my article “Differences in memory management between Direct3D 12 and Vulkan”. I also gave a talk “Memory management in Vulkan and DX12” at GDC 2018 and my colleague Ste Tovey presented much more details in his talk “Memory Management in Vulkan” at Vulkanised 2018.

In this article, I would like to present common patterns seen on the list of memory types available in Vulkan on Windows PCs. First, let me recap what the API offers: Unlike in DX12, where you have just 3 default “memory heap types” (D3D12_HEAP_TYPE_DEFAULT, D3D12_HEAP_TYPE_UPLOAD, D3D12_HEAP_TYPE_READBACK), in Vulkan there is a 2-level hierarchy, a list of “memory heaps” and “memory types” inside them you need to query and that can look completely different on various GPUs, operating systems, and driver versions. Some constraints and guarantees apply, as described in Vulkan specification, e.g. there is always some DEVICE_LOCAL and some HOST_VISIBLE memory type.

A memory heap, as queried from vkGetPhysicalDeviceMemoryProperties and returned in VkMemoryHeap, represents some (more or less) physical memory, e.g. video RAM on the graphics card or system RAM on the motherboard. It has some fixed size in bytes, and current available budget that can be queried using extension VK_EXT_memory_budget. A memory type, as returned in VkMemoryType, belongs to certain heap and offers a “view” to that heap with certain properties, represented by VkMemoryPropertyFlags. Most notable are:

VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, which always matches flag VK_MEMORY_HEAP_DEVICE_LOCAL_BIT in the heap it belongs to, informs that the memory is local to the “device” (the GPU in Vulkan terminology). It doesn’t change what you can or cannot do with this memory type. If creating certain buffers or textures was possible only in GPU and not CPU memory, it would be expressed by appropriate bits not set in VkMemoryRequirements::memoryTypeBits. DEVICE_LOCAL flag set in a memory type is just a hint for us that resources created in that memory will probably work faster when accessed on the GPU.
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT. Unlike previous one, this flag changes a lot. It means that you can call vkMapMemory on VkDeviceMemory objects allocated from this type and get a raw, CPU-side pointer to its data. In short: you can access this memory directly from the CPU, without a need to launch a Vulkan command for explicit transfer, like vkCmdCopyBuffer.
VK_MEMORY_PROPERTY_HOST_CACHED_BIT, which can occur only on memory types that are also HOST_VISIBLE. This one again is just a hint for us. It changes nothing we can or cannot do with that memory. It just informs us that access to this memory will go through cache (from CPU perspective). As a result, a memory type with this flag should be fast to write, read, and access randomly via mapped pointer. What the lack of this flag means is not clearly defined. Such memory may represent system RAM or even video RAM, but a common meaning (at least on PC) is that accesses are then uncached but write-combined from CPU perspective, which means we should only write to it sequentially (best to do memcpy), never read from it or jump over random places, as it may be slow.

Theoretically, a good algorithm as recommended by the spec, to search for the first memory type meeting your requirements, should be robust enough to work on any GPU, but, if you want make sure your application works correctly and efficiently on a variety of graphics hardware available on the market today, you may need to adjust your resource management policy to a specific set of memory heaps/types found on a user’s machine. To simplify this task, below I present common patterns that can be observed on the list of Vulkan memory heaps and types on various GPUs, on Windows PCs. I also describe their meaning and consequences.

Before I start, I must show you website vulkan.gpuinfo.org, if you don’t already know it. It is a great database of all Vulkan capabilities, features, limits, and extensions, including memory heaps/types, cataloged from all kinds of GPUs and operating systems.

1. The Intel way

Intel manufactures integrated graphics (although they also released a discrete card recently). As GPU integrated into CPU, it shares the same memory. It then makes sense to expose following memory types in Vulkan (example: Intel(R) UHD Graphics 600):

Heap 0: DEVICE_LOCAL
  Size = 1,849,059,532 B
  Type 0: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT
  Type 1: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT, HOST_CACHED

What it means: The simplest and the most intuitive set of memory types. There is just one memory that represents system RAM, or a part of it that can be used for graphics resources. All memory types are DEVICE_LOCAL, which means GPU has fast access to them. They are also all HOST_VISIBLE – accessible to the CPU. Type 0 without HOST_CACHED flag is good for writing through mapped pointer and reading by the GPU, while type 1 with HOST_CACHED flag is good for writing by the GPU commands and reading via mapped pointer.

How to use it: You can just load your resources directly from disk. There is no need to create a separate staging copy, separate GPU copy, and issue a transfer command, like we do with discrete graphics cards. With images you need to use VK_IMAGE_TILING_OPTIMAL for best performance and so you need to vkCmdCopyBufferToImage, but at least for buffers you can just map them, fill the content via CPU pointer and then tell GPU to use that memory – an approach which can save both time and precious bytes of memory.

2. The NVIDIA way

For discrete graphics cards, it makes perfect sense to have two memory heaps – one DEVICE_LOCAL to represent video RAM and the other one without this flag representing system RAM. This is exactly what NVIDIA cards do. For example (NVIDIA GeForce RTX 2070):

Heap 0: DEVICE_LOCAL
  Size = 8,421,113,856 B
  Type 0: DEVICE_LOCAL
  Type 1: DEVICE_LOCAL
Heap 1
  Size = 8,534,777,856 B
  Type 0
  Type 1: HOST_VISIBLE, HOST_COHERENT
  Type 2: HOST_VISIBLE, HOST_COHERENT, HOST_CACHED

What it means: Let’s disregard high number of memory types available. NVIDIA likes to keep different types of resources (e.g. depth-stencil textures, render targets, buffers) in separate memory blocks, so it will just limit the types available for certain resources via VkMemoryRequirements::memoryTypeBits returned for a buffer or image. What is important here is that we have a disjoint set of memory types that are DEVICE_LOCAL (video RAM) and these that are HOST_VISIBLE (system RAM).

How to use it: We certainly need to create a staging copy of our resources in HOST_VISIBLE memory, at least temporarily, to load them from disk and then issue an explicit transfer using e.g. vkCmdCopyBuffer, vkCmdCopyBufferToImage to put them in another resource, created in DEVICE_LOCAL memory, that will be fast to access on the GPU.

There might be a possibility for the GPU to access resources created in non-DEVICE_LOCAL memory directly. To check if this is the case, create a buffer or textures with usage flags like VK_BUFFER_USAGE_VERTEX_BUFFER_BIT, VK_IMAGE_USAGE_SAMPLED_BIT and see if certain memory types are among bits returned via VkMemoryRequirements::memoryTypeBits. If it is possible, making GPU reading/writing data straight from system RAM via PCI Express bus will be slow, but may be beneficial over having two copies of the resource and issuing a transfer command in certain cases, e.g. when each data written on the CPU is read once on the GPU, or data is so small that will end up in GPU caches quickly (e.g. a tiny uniform buffer). You can test both approaches and measure which one works faster in your use case.

3. The AMD way

There is a possibility for CPU to address some video memory directly via normal void* pointer. This feature, known as Base Address Register (BAR), among other names, can be exposed as a separate memory heap and type that is both DEVICE_LOCAL and HOST_VISIBLE. AMD cards support it for years. NVIDIA also started doing it in their recent drivers. Example: (AMD Radeon RX 5700 XT)

Heap 0: DEVICE_LOCAL
  Size = 8,304,721,920 B
  Type 0: DEVICE_LOCAL
Heap 1
  Size = 16,865,296,384 B
  Type 0: HOST_VISIBLE, HOST_COHERENT
  Type 1: HOST_VISIBLE, HOST_COHERENT, HOST_CACHED
Heap 2: DEVICE_LOCAL
  Size = 268,435,456 B
  Type 0: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT

What it means: This list of memory heaps/types looks similar to the previous one, except now we have an additional, 3rd heap that has fixed size of only 256 MB. The memory type in this heap is DEVICE_LOCAL, but also HOST_VISIBLE at the same time. Most probably it is not a separate RAM chip, so things start getting “virtual” here (contrary to the promise of Vulkan API, which was supposed to be low level and closely represent modern GPU hardware…) This memory is located on the graphics card, but is also accessible for mapping. We can then suppose that accessing such pointer with require to transfer data over PCIe bus, so it will be quite slow. The lack of HOST_CACHED flag on this mem type indicates that it is not cached from CPU perspective, so it is better to only write to it sequentially.

How to use it: This limited amount of 256 MB special memory can be used for resources that are written from CPU and read on GPU to avoid having two copies and issuing an explicit vkCmdCopy*, just like I described in point 1. It might be a good idea to put there resources that are changing every frame, like a ring buffer with uniforms (constants). Just remember that graphics driver may also use this special memory to optimize usage of some implicit Vulkan stuff (e.g. descriptors), so don’t use full 256 MB or even better – query for current budget.

4. The SAM way

Smart Access Memory (SAM) is an AMD’s marketing term for what is also commonly called Resizable BAR or ReBAR. It is a hardware feature that allows CPU to access directly not only 256 MB, but the entire video memory. It requires support down to the lowest hardware level and it is only getting popularity at the moment I write this article, so to experience it, you need to have compatible motherboard and CPU (e.g. Ryzen 5000 series), update BIOS, have a new graphics card (e.g. Radeon 6000 series), install recent graphics driver, and finally enable the feature itself in BIOS (look for: Advanced > PCI Subsystem Settings > Above 4G Decoding and Re-Size BAR Support). When enabled, there is no longer a separate 256 MB heap of HOST_VISIBLE video memory. Instead, entire main DEVICE_LOCAL heap is also accessible via a HOST_VISIBLE memory type. For example (listed locally on my Radeon RX 6800 XT, note numbering of memory type looks different here, as I show their true indices on the global list, while vulkan.gpuinfo.org indexes them from 0 in each heap):

Heap 0
  Size = 25,454,182,400 B
  Type 1: HOST_VISIBLE, HOST_COHERENT
  Type 3: HOST_VISIBLE, HOST_COHERENT, HOST_CACHED
  Type 5: HOST_VISIBLE, HOST_COHERENT, AMD-specific flags...
  Type 7: HOST_VISIBLE, HOST_COHERENT, HOST_CACHED, AMD-specific flags...
Heap 1: DEVICE_LOCAL | MULTI_INSTANCE
  Size = 17,163,091,968 B
  Type 0: DEVICE_LOCAL
  Type 2: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT
  Type 4: DEVICE_LOCAL, AMD-specific flags...
  Type 6: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT, AMD-specific flags...

What it means: Let’s disregard the AMD-specific flags for now. Here apparently heap 0 is system RAM and heap 1 is video RAM. Entire video RAM is accessible for CPU access via some memory types that are DEVICE_LOCAL and HOST_VISIBLE simultaneously. There is no separate, small 256 MB heap for that.

How to use it: When your app finds itself in a system that has SAM enabled, it can make use of more memory in the same way as I described above – to write directly from the CPU and then use it on the GPU. With images, you still need to copy using vkCmdCopyBufferToImage to get OPTIMAL tiling into an opaque, GPU-specific pixel swizzling or other internal compression format. But for buffers at least, you can avoid having extra copy and issuing an explicit transfer to save time and (system memory) space. What changes now is that with SAM, you can do it for over 256 MB of your data.

A side note: If you have a texture that changes frequently, possibly writing it directly on the CPU via mapped pointer and reading on a GPU can be faster than doing vkCmdCopy*, even if it means the image has to use VK_IMAGE_TILING_LINEAR. This is what DXVK (a Direct3D implementation over Vulkan) is doing for textures created with D3D11_USAGE_DYNAMIC flag – or at least it did when I checked it some time ago, if I remember correctly. As always, it is best to implement multiple approaches and measure which one works faster.

Please note there might also be a memory type that is DEVICE_LOCAL but not HOST_VISIBLE. Whether it works faster than the HOST_VISIBLE one and so it makes any sense to use it, or it is left just for backward compatibility, is not clear to me. When not sure, better to select a memory type with less additional flags than the required/desired ones, or a memory type just higher on the list.

5. The APU way

APU – the AMD integrated graphics, shares the same memory with the CPU, just like Intel integrated graphics, but the set of exposed Vulkan memory heaps and types is a complete opposite. While Intel shows the simplest and the most natural collection, the way AMD driver for integrated graphics does this is the most “virtualized” and troublesome. Just have a look (example: AMD Radeon(TM) Vega 8):

Heap 0
  Size = 3,855,089,664 B
  Type 0: HOST_VISIBLE, HOST_COHERENT
  Type 0: HOST_VISIBLE, HOST_COHERENT, HOST_CACHED
Heap 1: DEVICE_LOCAL
  Size = 268,435,456 B
  Type 0: DEVICE_LOCAL
  Type 1: DEVICE_LOCAL, HOST_VISIBLE, HOST_COHERENT

What it means: Despite using the same physical system RAM, Vulkan memory is divided into 2 heaps and multiple types. Some of them are not DEVICE_LOCAL, some not HOST_VISIBLE. What is worse is that the DEVICE_LOCAL heap doesn’t span the entire RAM. Instead, it is only 256 MB.

How to use it: You will get into trouble on such platforms if your application tries to fit all resources needed for rendering in DEVICE_LOCAL memory, e.g. by creating critical resources like render-target, depth-stencil textures and then streaming other resources until heap size or budget is reached. Here, 256 MB will probably not be enough to fit even the most important, basic resources, not to mention meshes and textures needed to render a pretty scene. To support this GPU, you need to fall back to non-DEVICE_LOCAL memory types with your resources and assume they don’t work much slower than DEVICE_LOCAL. To detect that, possibly you can call vkGetPhysicalDeviceProperties and check if VkPhysicalDeviceProperties::deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU not DISCRETE_GPU.

Are DEVICE_LOCAL memory types faster to be used by the GPU than the ones without this flag? We cannot be sure, but we can assume so, as this is the meaning of this flag, after all. Even if they all refer to the same physical memory, they may e.g. use larger page sizes, different caching policies, or other optimizations.

The 5 patterns shown above may help you think about memory management in Vulkan, but always remember it is only a simplification. Reality is more complex than that and also changes over time, with new versions of graphics drivers. For example:

Intel now started showing additional memory type that is DEVICE_LOCAL but not HOST_VISIBLE, like AMD APU - see new entry for the same Intel(R) UHD Graphics 600 as mentioned above.
NVIDIA now started showing additional 256 MB heap that is both DEVICE_LOCAL and HOST_VISIBLE, like on AMD discrete cards - see new entry for the same NVIDIA GeForce RTX 2070 as mentioned above.

Some additional notes:

VK_MEMORY_PROPERTY_HOST_COHERENT_BIT flag occurs on memory types that are also HOST_VISIBLE and means that writes/read to this memory on the CPU are made coherent automatically. Without this flag, you need to call vkFlushMappedMemoryRanges after writing and vkInvalidateMappedMemoryRanges before reading the memory via CPU pointer, before/after you use it on the GPU, to make sure caches are flushed/invalidated automatically. Note that mapping/unmapping memory doesn’t play a role here and is not even necessary – you can leave your memory persistently mapped while used on the GPU, as long as you ensure proper synchronization e.g. using VkFence. The reason I didn’t talk about this flag is that all HOST_VISIBLE memory types on all GPUs I’ve ever seen on Windows PC have HOST_COHERENT flag also set. This may change in the future, so a fully robust application should watch out for memory types without it and flush/invalidate accordingly, but for now, non-HOST_COHERENT memory types are a thing on mobile GPUs. Same with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT – a flag that can be found only on mobile chips currently and not on the PC.

You may have noticed that NVIDIA offers some memory types that are not DEVICE_LOCAL but also not HOST_VISIBLE. We can assume these are in system RAM. Does it make any sense to have system memory not accessible to the CPU? What can we use it for? While it is not clear it has any benefits over memory types from the same heap that have HOST_VISIBLE flag, it may be used to keep staging copy of resources transferred from video memory to system memory as part of custom paging/residency mechanism, to be copied back to GPU memory when needed.

Finally, the mysterious “AMD-specific flags” I mentioned above are additional flags VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD, VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD, added by custom vendor extensions VK_AMD_device_coherent_memory, that started appearing in recent drivers. My current understanding is that these memory types, while working slower than normal, offer more reliable mechanism for accessing memory from the GPU useful especially to write “breadcrumb markers” for debugging GPU crashes. Even if you don’t care about this extension, watch out for these extra memory types. They are there on the list, even if you don’t enable this extension. There is no way not to report them, as memory heaps and types are queried from VkPhysicalDevice, before VkDevice is created with certain extensions enabled or not enabled. Driver could then not include these memory types in VkMemoryRequirements::memoryTypeBits for all buffers and textures if the extension was not enabled, but unfortunately it does. So if your allocation from a “normal” memory type fails e.g. because of exceeding heap size and your algorithm then tries to use next eligible type, you may end up trying to use a memory with these custom AMD flags, which will generate validation layer error, as you shouldn’t even try to use these types without enabling appropriate extension. So better make your code aware of these additional memory property flags.

Update 2022-02-26: I've also written an article that is kind of equivalent of this one but for Direct3D 12 - see Untangling Direct3D 12 Memory Heap Types and Pools.

Comments | #rendering #vulkan Share