Debugging Vulkan driver crash - equivalent of NVIDIA Aftermath

# Debugging Vulkan driver crash - equivalent of NVIDIA Aftermath

Wed
28
Mar 2018

New generation, explcit graphics APIs (Vulkan and DirectX 12) are more efficient, involve less CPU overhead. Part of it is that they don't check most errors. In old APIs (Direct3D 9, OpenGL) every function call was validated internally, returned success of failure code, while driver crash indicated a bug in driver code. New APIs, on the other hand, rely on developer doing the right thing. Of course some functions still return error code (especially ones that allocate memory or create some resource), but those that record commands into a command buffer just return void. If you do something illegal, you can expect undefined behavior. You can use Validation Layers / Debug Layer to do some checks, but otherwise everything may work fine on some GPUs, you may get incorrect result, or you may experience driver crash or timeout (called "TDR"). Good thing is that (contrary to old Windows XP), crash inside graphics driver doesn't cause "blue screen of death" or machine restart. System just restarts graphics hardware and driver, while your program receives VK_ERROR_DEVICE_LOST code from one of functions like vkQueueSubmit. Unfortunately, you then don't know which specific draw call or other command caused the crash.

NVIDIA proposed solution for that: they created NVIDIA Aftermath library. It lets you (among other things) record commands that write custom "marker" data to a buffer that survives driver crash, so you can later read it and see which command was successfully executed last. Unfortunately, this library works only with NVIDIA graphics cards and only in D3D11 and D3D12.

I was looking for similar solution for Vulkan. When I saw that Vulkan can "import" external memory, I thought that maybe I could use function vkCmdFillBuffer to write immediate value to such buffer and this way implement the same logic. I then started experimenting with extensions: VK_KHR_get_physical_device_properties_2, VK_KHR_external_memory_capabilities, VK_KHR_external_memory, VK_KHR_external_memory_win32, VK_KHR_dedicated_allocation. I was basically trying to somehow allocate a piece of system memory and import it to Vulkan to write to it as Vulkan buffer. I tried many things: CreateFileMapping + MapViewOfFile, HeapCreate + HeapAlloc and other ways, with various flags, but nothing worked for me. I also couldn't find any description or sample code of how these extensions could be used in Windows to import some system memory as Vulkan buffer.

Everything changed when I learned that creating normal device memory and buffer inside Vulkan is enough! It survives driver crash, so its content can be read later via mapped pointer. No extensions required. I don't think this is guaranteed by specification, but it seems to work on both AMD and NVIDIA cards. So my current solution to write makers that survive driver crash in Vulkan is:

  1. Call vkAllocateMemory to allocate VkDeviceMemory from memory type that has HOST_VISIBLE + HOST_COHERENT flags. (This is system RAM. Spec guarantees that you can always find such type.)
  2. Map the memory using vkMapMemory to get raw CPU pointer to its data.
  3. Call vkCreateBuffer to create VkBuffer with VK_BUFFER_USAGE_TRANSFER_DST_BIT and bind it to that memory using vkBindBufferMemory.
  4. While recording commands to VkCommandBuffer, use vkCmdFillBuffer to write immediate data with your custom "markers" to the buffer.
  5. If everything goes right, don't forget to vkDestroyBuffer and vkFreeMemory during shutdown.
  6. If you experience driver crash (receive VK_ERROR_DEVICE_LOST), read data under the pointer to see what marker values were successfully written last and deduce which one of your commands might cause the crash.

There is also a new extension available on latest AMD drivers: VK_AMD_buffer_marker. It adds just one function: vkCmdWriteBufferMarkerAMD. It works similar to beforementioned vkCmdFillBuffer, but it adds two good things that let you write your markers with much better granularity:

I created a simple library that implements all this logic under easy interface, which I called "Vulkan AfterCrash". All you need to use it is just this single file: VulkanAfterCrash.h.

Update 4 April 2018: In GDC 2018 talk "Aftermath: Advances in GPU Crash Debugging (Presented by NVIDIA)", Alex Dunn announced that a Vulkan extension from NVIDIA will also be available, called VK_NV_device_diagnostic_checkpoints, but I can see it's not publicly accessible yet.

Update 1 August 2018: Documentation for extension VK_NV_device_diagnostic_checkpoints has been published in Vulkan version 1.1.82.

Update 12 September 2018: I've created similar, portable library for Direct3D 12 - see blog post "Debugging D3D12 driver crash".

Comments | #vulkan #graphics #libraries #productions Share

Comments

STAT NO AD
[Stat] [STAT NO AD] [Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2018