Entries for tag "rendering", ordered from most recent. Entry count: 183.
# Improving the quality of the alpha test (cutout) materials
Fri
24
Apr 2020
This is a guest post from my friend Łukasz Izdebski Ph.D.
Today I want to share with you a trick which my collage from previews work mentioned to me a long time ago. It's about alpha tested (also known as cutout) materials. This technique which I want to share with you consists of two neat tricks that can improve the quality of alpha tested (cutout) materials.
Alpha test is an old technique used in computer graphics. The idea behind it is very simple. In a very basic form, a material (shader) of a rendered object can discard processed pixels based on the alpha channel of RGBA texture. When shaded pixel’s final alpha value is less than this threshold value (threshold value is constant for the instance of the material and a typical value is 50%), it is clipped (discarded) and will not land in the shaders output framebuffer. These types of materials are commonly used to render vegetation, fences, impostors/billboards, etc.
Alpha tested materials have some, I will say a little issue. It can be noticed when rendered object (with this material) is far away from the camera. Let the following video below be an example of this issue.
Comments | #rendering #math Share
# Secrets of Direct3D 12: Resource Alignment
Sun
19
Apr 2020
In the new graphics APIs - Direct3D 12 and Vulkan - creation of resources (textures and buffers) is a multi-step process. You need to allocate some memory and place your resource in it. In D3D12 there is a convenient function ID3D12Device::CreateCommittedResource
that lets you do it in one go, allocating the resource with its own, implicit memory heap, but it's recommended to allocate bigger memory blocks and place multiple resources in them using ID3D12Device::CreatePlacedResource
.
When placing a resource in the memory, you need to know and respect its required size and alignment. Size is basically the number of bytes that the resource needs. Alignment is a power-of-two number which the offset of the beginning of the resource must be multiply of (offset % alignment == 0
). I'm thinking about writing a separate article for beginners explaining the concept of memory alignment, but that's a separate topic...
Back to graphics, in Vulkan you first need to create your resource (e.g. vkCreateBuffer
) and then pass it to the function (e.g. vkGetBufferMemoryRequirements
) that will return required size of alignment of this resource (VkMemoryRequirements::size
, alignment
). In DirectX 12 it looks similar at first glance or even simpler, as it's enough to have a structure D3D12_RESOURCE_DESC
describing the resource you will create to call ID3D12Device::GetResourceAllocationInfo
and get D3D12_RESOURCE_ALLOCATION_INFO
- a similar structure with SizeInBytes
and Alignment
. I've described it briefly in my article Differences in memory management between Direct3D 12 and Vulkan.
But if you dig deeper, there is more to it. While using the mentioned function is enough to make your program working correctly, applying some additional knowledge may let you save some memory, so read on if you want to make your GPU memory allocator better. First interesting information is that alignments in D3D12, unlike in Vulkan, are really fixed constants, independent of a particular GPU or graphics driver that the user may have installed.
D3D12_DEFAULT_RESOURCE_PLACEMENT_ALIGNMENT
.D3D12_DEFAULT_MSAA_RESOURCE_PLACEMENT_ALIGNMENT
.So, we have these constants and we also have a function to query for actual alignment. To make things even more complicated, structure D3D12_RESOURCE_DESC
contains Alignment
member, so you have one alignment on the input, another one on the output! Fortunately, GetResourceAllocationInfo
function allows to set D3D12_RESOURCE_DESC::Alignment
to 0, which causes default alignment for the resource to be returned.
Now, let me introduce the concept of "small textures". It turns out that some textures can be aligned 4 KB and some MSAA textures can be aligned to 64 KB. They call this "small" alignment (as opposed to "default" alignment) and there are also constants for it:
D3D12_SMALL_RESOURCE_PLACEMENT_ALIGNMENT
.D3D12_SMALL_MSAA_RESOURCE_PLACEMENT_ALIGNMENT
.Default | Small | |
---|---|---|
Buffer | 64 KB | |
Texture | 64 KB | 4 KB |
MSAA texture | 4 MB | 64 KB |
Using this smaller alignment allows to save some GPU memory that would otherwise be unused as padding between resources. Unfortunately, it's unavailable for buffers and available only for small textures, with a very convoluted definition of "small". The rules are hidden in the description of Alignment member of D3D12_RESOURCE_DESC structure:
UNKNOWN
layout.RENDER_TARGET
or DEPTH_STENCIL
.Could GetResourceAllocationInfo
calculate all this automatically and just return optimal alignment for a resource, like Vulkan function does? Possibly, but this is not what happens. You have to ask for it explicitly. When you pass D3D12_RESOURCE_DESC::Alignment
= 0 on the input, you always get the default (larger) alignment on the output. Only when you set D3D12_RESOURCE_DESC::Alignment
to the small alignment value, this function returns the same value if the small alignment has been "granted".
There are two ways to use it in practice. First one is to calculate the eligibility of a texture to use small alignment on your own and pass it to the function only when you know the texture fulfills the conditions. Second is to try the small alignment always. When it's not granted, GetResourceAllocationInfo
returns some values other than expected (in my test it's Alignment
= 64 KB and SizeInBytes
= 0xFFFFFFFFFFFFFFFF). Then you should call it again with the default alignment. That's the method that Microsoft shows in their "Small Resources Sample". It looks good, but a problem with it is that calling this function with an alignment that is not accepted generates D3D12 Debug Layer error #721 CREATERESOURCE_INVALIDALIGNMENT. Or at least it used to, because on one of my machines the error no longer occurs. Maybe Microsoft fixed it in some recent update of Windows or Visual Studio / Windows SDK?
Here comes the last quirk of this whole D3D12 resource alignment topic: Alignment is applied to offset used in CreatePlacedResource
, which we understand as relative to the beginning of an ID3D12Heap
, but the heap itself has an alignment too! D3D12_HEAP_DESC
structure has Alignment
member. There is no equivalent of this in Vulkan. Documentation of D3D12_HEAP_DESC structure says it must be 64 KB or 4 MB. Whenever you predict you might create an MSAA textures in a heap, you must choose 4 MB. Otherwise, you can use 64 KB.
Thank you, Microsoft, for making this so complicated! ;) This article wouldn't be complete without the advertisement of open source library: D3D12 Memory Allocator (and similar Vulkan Memory Allocator), which automatically handles all this complexity. It also implements both ways of using small alignment, selectable using a preprocessor macro.
Comments | #directx #rendering #microsoft Share
# Links to GDC 2020 Talks and More
Sat
28
Mar 2020
March is an important time of year for game developers, as that's when Game Developers Conference (GDC) takes place - the most important conference of the industry. This year's edition has been cancelled because of coronavirus pandemic, just like all other events, or rather postponed to a later date. But many companies prepared their talks anyway. Surely, they had to submit their talks long time ago, plus any preparation, internal technical and legal reviews... The time spent on this shouldn't be wasted. That's why many of them shared their talks online as videos and/or slides. Below, I try to gather links to these materials with a full list of titles, with special focus on programming talks.
GDC
They organized multi-day "Virtual Talks" event presented on Twitch, with replays now available to watch and slides accessible on their website.
GDC Videos @ Twitch
GDC 2020 Virtual Talks (agenda)
GDC Vault - GDC 2020 (slides)
Monday, March 16
The 'Kine' Postmortem
Storytelling with Verbs: Integrating Gameplay and Narrative
Intrinsically Motivated Teams: The Manager's Toolbox
From 'Assassin's Creed' to 'The Dark Eye': The Importance of Themes
Representing LGBT+ Characters in Games: Two Case Studies
The Sound of Anthem
Is Your Game Cross-Platform Ready?
Forgiveness Mechanics: Reading Minds for Responsive Gameplay
Experimental AI Lightning Talk: Hyper Realistic Artificial Voices for Games
Tuesday, March 17
What to Write So People Buy: Selling Your Game Without Feeling Sleazy
Failure Workshop: FutureGrind: How To Make A 6-Month Game In Only 4.5 Years
Stress-Free Game Development: Powering Up Your Studio With DevOps
Baked in Accessibility: How Features Were Approached in 'Borderlands 3'
Matchmaking for Engagement: Lessons from 'Halo 5'
Forget CPI: Dynamic Mobile Marketing
Integrating Sound Healing Methodologies Into Your Workflow
From 0-1000: A Test Driven Approach to Tools Development
Overcoming Creative Block on 'Super Crush KO'
When Film, Games, and Theatre Collide
Wednesday, March 18
Bringing Replays to 'World of Tanks: Mercenaries'
Developing and Running Neural Audio in Constrained Environments
Mental Health State of the Industry: Past, Present & Future
Empathizing with Steam: How People Shop for Your Game
Scaling to 10 Concurrent Users: Online Infrastructure as an Indie
Crafting A Tiny Open World: 'A Short Hike' Postmortem
Indie Soapbox: UI design is fun!
Don't Ship a Product, Ship Value: Start Your Minimum Viable Product With a Solution
Day of the Devs: GDC Edition Direct
Independent Games Festival & Game Developers Choice Awards
Thursday, March 19
Machine Learning for Optimal Matchmaking
Skill Progression, Visual Attention, and Efficiently Getting Good at Esports
Making Your Game Influencer Ready: A Marketing Wishlist for Developers
How to Run Your Own Career Fair on a Tiny Budget
Making a Healthy Social Impact in Commercial Games
'Forza' Monthly: Live Streaming a Franchise
Aesthetic Driven Development: Choosing Your Art Before Making a Game
Reading the Rules of 'Baba Is You'
Friday, March 20
Beyond Games as a Service with Live Ops
Kill the Hero, Save the (Narrative) World
'Void Bastards' Art Style Origin Story
Writing Tools Faster: Design Decisions to Accelerate Tool Development
Face-to-Parameter Translation via Neural Network Renderer
The Forest Paths Method for Accessible Narrative Design
'Gears 5' Real-Time Character Dynamics
Stop & Think: Teaching Players About Media Manipulation in 'Headliner'
Microsoft
They organized "DirectX Developer Day" where they announced DirectX 12 Ultimate - a fancy name for the updated Direc3D 12_2 with new major features including DXR (Ray Tracing), Variable Rate Shading, and Mesh Shaders.
DirectX Developer Blog
Microsoft DirectX 12 and Graphics Education @ YouTube
DirectX Developer Day 2020 #DXDevDay @ Mixer (talks as one long stream)
DXR 1.1 Inline Raytracing
Advanced Mesh Shaders
Reinventing the Geometry Pipeline: Mesh Shaders in DirectX 12
DirectX 12 Sampler Feedback
PIX on Windows
HLSL Compiler
NVIDIA
That's actually GPU Technology Conference (GTC) - a separate event. Their biggest announcement this month was probably DLSS 2.0.
RTX-Accelerated Hair Brought to Life with NVIDIA Iray
Material Interoperability Using MaterialX, Standard Surface, and MDL
The Future of GPU Raytracing
Visuals as a Service (VaaS): How Amazon and Others Create and Use Photoreal On-Demand Product Visuals with RTX Real-Time Raytracing and the Cloud
Next-Gen Rendering Technology at Pixar
New Features in OptiX 7
Production-Quality, Final-Frame Rendering on the GPU
Latest Advancements for Production Rendering with V-Ray GPU and Real-Time Raytracing with Project Lavina
Accelerated Light-Transport Simulation using Neural Networks
Bringing the Arnold Renderer to the GPU
Supercharging Adobe Dimension with RTX-Enabled GPU Raytracing
Sharing Physically Based Materials Between Renderers with MDL
Real-Time Ray-Traced Ambient Occlusion of Complex Scenes using Spatial Hashing
I also found some other videos on Google:
DLSS - Image Reconstruction for Real-time Rendering with Deep Learning
NVIDIA Vulkan Features Update – including Vulkan 1.2 and Ray Tracing
3D Deep Learning in Function Space
Unleash Computer Vision at the Edge with Jetson Nano and Always AI
Optimized Image Classification on the Cheap
Cisco and Patriot One Technologies Bring Machine Learning Projects from Imagination to Realization (Presented by Cisco)
AI @ The Network Edge
Animation, Segmentation, and Statistical Modeling of Biological Cells Using Microscopy Imaging and GPU Compute
Improving CNN Performance with Spatial Context
Weakly Supervised Training to Achieve 99% Accuracy for Retail Asset Protection
Combating Problems Like Asteroid Detection, Climate Change, Security, and Disaster Recovery with GPU-Accelerated AI
Condensa: A Programming System for DNN Model Compression
AI/ML with vGPU on Openstack or RHV Using Kubernetes
CTR Inference Optimization on GPU
NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
Leveraging NVIDIA’s Technology for the Ultimate Industrial Autonomous Transport Robot
How to Create Generalizable AI?
Isaac Sim 2020 Deep Dive
But somehow I can't find their full list with links to them anywhere on their website. More talks are accessible after free registration on the event website.
Intel
GDC 2020. A Repository for all Intel Technical Content prepared for GDC
Intel Software @ YouTube
Multi-Adapter with Integrated and Discrete GPUs
Optimizing World of Tanks*: from Laptops to High-End PCs
Intel® oneAPI Rendering Toolkit and its Application to Games
Intel® ISPC in Unreal Engine 4: A Peek Behind the Curtain
Variable Rate Shading with Depth of Field
For the Alliance! World of Warcraft and Intel discuss an Optimized Azeroth
Intel® Open Image Denoise in Blender - GDC 2020
Variable Rate Shading Tier 1 with Microsoft DirectX* 12 from Theory to Practice
Does Your Game's Performance Spark Joy? Profiling with Intel® Graphics Performance Analyzers
Boost CPU performance with Intel® VTune Profiler
DeepMotion | Optimize CPU Performance with Intel VTune Profiler
Google for Games Developer Summit 2020 @ YouTube (a collection of playlists)
Mobile Track
Google for Games Developer Summit Keynote
What's new in Android game development tools
What's new in Android graphics optimization tools
Android memory tools and best practices
Deliver higher quality games on more devices
Google Play Asset Delivery for games: Product deep dive and case studies
Protect your game's integrity on Google Play
Accelerate your business growth with leading ad strategies
Firebase games SDK news
Cloud Firestore for Game Developers
Clouds Track
Google for Games Developer Summit Keynote
Scaling globally with Game Servers and Agones (Google Games Dev Summit)
How to make multiplayer matchmaking easier and scalable with Open Match (Google Games Dev Summit)
Unity Game Simulation: Find the perfect balance with Unity and GCP (Google Games Dev Summit)
How Dragon Quest Walk handled millions of players using Cloud Spanner (Google Games Dev Summit)
Building gaming analytics online services with Google Cloud and Improbable (Google Games Dev Summit)
Stadia Track
Google for Games Developer Summit Keynote
Bringing Destiny to Stadia: A postmortem (Google Games Dev Summit)
Stadia Games & Entertainment presents: Creating for content creators (Google Games Dev Summit)
Empowering game developers with Stadia R&D (Google Games Dev Summit)
Stadia Games & Entertainment presents: Keys to a great game pitch (Google Games Dev Summit)
Supercharging discoverability with Stadia (Google Games Dev Summit)
Ubisoft
Ubisoft’s GDC 2020 Talks Online Now
Online Game Technology Summit: Start-And-Discard: A Unified Workflow for Development and Live
Finding Space for Sound: Environmental Acoustics
Game Server Performance
NPC Voice Design
Machine Learning Summit: Ragdoll Motion Matching
Machine Learning, Physics Simulation, Kolmogorov Complexity, and Squishy Bunnies
Khronos: I can't find any information about individual talks from them. There is only a note about GDC 2020 Live Streams pointing to GDC Twitch channel.
AMD: No information.
Sony: No information.
Consoles: Last but not least, March 2020 was also the time when the details of the upcoming new generation of consoles have been announced - Xbox Series S and PlayStation 5. You can easily find information about them by searching the Internet, so I won't recommend any links.
If you know about any more GDC 2020 or other important talks related to programming that have been released recently, please contact me or leave a comment below and I will add them!
Maybe there a positive side of this pandemic? With GDC taking place, developers had to pay $1000+ entrance fee for the event. They had to book a flight to California and a hotel in San Francisco, which was prohibitively expensive for many. They had to apply for ESTA or a vista to the US, which not everyone could get. And the talks eventually landed behind a paywall, scoring even more money to the organizers. Now we can educate ourselves for free from the safety and convenience of our offices and homes.
Comments | #gdc #intel #nvidia #google #events #directx #rendering #microsoft #rendering Share
# Initializing DX12 Textures After Allocation and Aliasing
Thu
19
Mar 2020
If you are a graphics programmer using Direct3D 12, you may wonder what's the initial content of a newly allocated buffer or texture. Microsoft admitted it was not clearly defined, but in practice such new memory is filled with zeros (unless you use the new flag D3D12_HEAP_FLAG_CREATE_NOT_ZEROED
). See article “Coming to DirectX 12: More control over memory allocation”. This behavior has its pros and cons. Clearing all new memory makes sense, as the operating system surely doesn't want to disclose to us the data left by some other process, possibly containing passwords or other sensitive information. However, writing to a long memory region takes lots of time. Maybe that's one reason GPU memory allocation is so slow. I've seen large allocations taking even hundreds of milliseconds.
There are situations when the memory of your new buffer or texture is not zero, but may contain some random data. First case is when you create a resource using CreatePlacedResource function, inside a memory block that you might have used before for some other, already released resources. That's also what D3D12 Memory Allocator library does by default.
It is important to know that in this case you must initialize the resource in a specific way! The rules are described on page: “ID3D12Device::CreatePlacedResource method” and say: If your resource is a texture that has either RENDER_TARGET
or DEPTH_STENCIL
flag, you must initialize it after allocation and before any other usage using one of those methods:
ClearRenderTargetView
or ClearDepthStencilView
),
DiscardResource
)
CopyResource
, CopyBufferRegion
, or CopyTextureRegion
).
Please note that rendering to the texture as a Render Target or writing to it as an Unordered Access View is not on the list! It means that, for example, if you implement a postprocessing effect, you allocated an intermediate 1920x1080 texture, and you want to overwrite all its pixels by rendering a fullscreen quad or triangle (better to use one triangle - see article "GCN Execution Patterns in Full Screen Passes"), then initializing the texture before your draw call seems redundant, but you still need to do it.
What happens if you don't? Why are we asked to perform this initialization? Wouldn't we just see random colorful pixels if we use an uninitialized texture, which may or may not be a problem, depending on our use case? Not really... As I explained in my previous post “Texture Compression: What Can It Mean?”, a texture may be stored in video memory in some vendor-specific, compressed format. If the metadata of such compression are uninitialized, it might have consequences more severe than observing random colors. It's actually an undefined behavior. On one GPU everything may work fine, while on the other you may see graphical corruptions that even rendering to the texture as a Render Target cannot fix (or a total GPU crash maybe?) I've experienced this problem myself recently.
Thinking in terms of internal GPU texture compression also helps to explain why is this initialization required only for render-target and depth-stencil textures. GPUs use more aggressive compression techniques for those. Having the requirements for initialization defined like that implies that you can leave buffers and other textures uninitialized and just experience random data in their content without the danger of anything worse happening.
I feel that a side note on ID3D12GraphicsCommandList::DiscardResource
function is needed, because many of you probably don't know it. Contrary to its name, this function doesn't release a resource or its memory. The meaning of this function is more like the mapping flag D3D11_MAP_WRITE_DISCARD
from the old D3D11. It informs the driver that the current content of the resource might be garbage; we know about it, and we don't care, we don't need it, not going to read it, just going to fill the entire resource with a new content. Sometimes, calling this function may let the driver reach better performance. For example, it may skip downloading previous data from VRAM to the graphics chip. This is especially important and beneficial on tile-based, mobile GPUs. In some other cases, like the initialization of a newly allocated texture described here, it is required. Inside of it, driver might for example clear the metadata of its internal compression format. It is correct to call DiscardResource
and then render to your new texture as a Render Target. It could also be potentially faster than doing ClearRenderTargetView
instead of DiscardResource
. By the way, if you happen to use Vulkan and still read that far, you might find it useful to know that the Vulkan equivalent of DiscardResource
is an image memory barrier with oldLayout = VK_IMAGE_LAYOUT_UNDEFINED
.
There is a second case when a resource may contain some random data. It happens when you use memory aliasing. This technique allows to save GPU memory by creating multiple resources in the same or overlapping region of a ID3D12Heap
. It was not possible in old APIs (Direct3D 11, OpenGL) where each resource got its own implicit memory allocation. In Direct3D you can use CreatePlacedResource
to put your resource in a specific heap, at a specific offset. It's not allowed to use aliasing resources at the same time. Sometimes you need some intermediate buffers or render targets only for a specific, short time during each frame. You can then reuse their memory for different resources needed in later part of the frame. That's the key idea of aliasing.
To do it correctly, you must do two things. First, between the usages you must issue a barrier of special type D3D12_RESOURCE_BARRIER_TYPE_ALIASING
. Second, the resource to be used next (also called "ResourceAfter", as opposed to "ResourceBefore") needs to be initialized. The idea is like what I described before. You can find the rules of this initialization on page “Memory Aliasing and Data Inheritance”. This time however we are told to initialize every texture that has RENDER_TARGET
or DEPTH_STENCIL
flag with 1. a clear or 2. a copy operation to an entire subresource. DiscardResource
is not allowed. Whether it's an omission or intentional, we have to stick to these rules, even if we feel such clears are redundant and will slow down our rendering. Otherwise we may experience hard to find bugs on some GPUs.
Update 2020-07-14: An engineer from Microsoft told me that the lack of DiscardResource
among valid methods of initializing a texture after aliasing is probably a docs oversight and it is correct to initialize it this way, so the last picture should actually have Discard as well, just like the first one.
Update 2020-12-22: Aliasing barrier and a Clear, Discard, or Copy is not all you need to do to properly initialize a texture after aliasing. You also need to take care of its state by issuing some transition barrier. To read more, see my new post: “States and Barriers of Aliasing Render Targets”.
Comments | #directx #rendering Share
# Texture Compression: What Can It Mean?
Sun
15
Mar 2020
"Data compression - the process of encoding information using fewer bits than the original representation." That's the definition from Wikipedia. But when we talk about textures (images that we use while rendering 3D graphics), it's not that simple. There are 4 different things we can mean by talking about texture compression, some of them you may not know. In this article, I'd like to give you some basic information about them.
1. Lossless data compression. That's the compression used to shrink binary data in size losing no single bit. We may talk about compression algorithms and libraries that implement them, like popular zlib or LZMA SDK. We may also mean file formats like ZIP or 7Z, which use these algorithms, but also define a way to pack multiple files with their whole directory structure into a single archive file.
Important thing to note here is that we can use this compression for any data. Some file types like text documents or binary executables have to be compressed in a lossless way so that no bits are lost or altered. You can also compress image files this way. Compression ratio depends on the data. The size of the compressed file will be smaller if there are many repeating patterns - the data look pretty boring, like many pixels with the same color. If the data is more varying, each next pixel has even slightly different value, then you may end up with a compressed file as large as original one or even larger. For example, following two images have size 480 x 480. When saved as uncompressed BMP R8G8B8 file, they both take 691,322 bytes. When compressed to a ZIP file, the first one is only 15,993, while the second one is 552,782 bytes.
We can talk about this compression in the context of textures because assets in games are often packed into archives in some custom format which protects the data from modification, speeds up loading, and may also use compression. For example, the new Call of Duty Warzone takes 162 GB of disk space after installation, but it has only 442 files because developers packed the largest data in some archives in files Data/data/data.000, 001 etc., 1 GB each.
2. Lossy compression. These are the algorithms that allow some data loss, but offer higher compression ratios than lossless ones. We use them for specific kinds of data, usually some media - images, sound, and video. For video it's virtually essential, because raw uncompressed data would take enormous space for each second of recording. Algorithms for lossy compression use the knowledge about the structure of the data to remove the information that will be unnoticeable or degrade quality to the lowest degree possible, from the perspective of human perception. We all know them - these are formats like JPEG for images and MP3 for music.
They have their pros and cons. JPEG compresses images in 8x8 blocks using Discrete Fourier Transform (DCT). You can find awesome, in-depth explanation of it on page: Unraveling the JPEG. It's good for natural images, but with text and diagrams it may fail to maintain desired quality. My first example saved as JPEG with Quality = 20% (this is very low, I usually use 90%) takes only 24,753 B, but it looks like this:
GIF is good for such synthetic images, but fails on natural images. I saved my second example as GIF with a color palette of 32 entries. The file is only 90,686 B, but it looks like this (look closer to see dithering used due to a limited number of colors):
Lossy compression is usually accompanied by lossless compression - file formats like JPEG, GIF, MP3, MP4 etc. compress the data losslessly on top of its core algorithm, so there is no point in compressing them again.
3. GPU texture compression. Here comes the interesting part. All formats described so far are designed to optimize data storage and transfer. We need to decompress all the textures packed in ZIP files or saved as JPEG before uploading them to video memory and using for rendering. But there are other types of texture compression formats that can be used by the GPU directly. They are lossy as well, but they work in a different way - they use a fixed number of bytes per block of NxN pixels. Thanks to this, a graphics card can easily pick right block from the memory and uncompress it on the fly, e.g. while sampling the texture. Some of such formats are BC1..7 (which stands for Block Compression) or ASTC (used on mobile platforms). For example, BC7 uses 1 byte per pixel, or 16 bytes per 4x4 block. You can find some overview of these formats here: Understanding BCn Texture Compression Formats.
The only file format I know which supports this compression is DDS, as it allows to store any texture that can be loaded straight to DirectX in various pixel formats, including not only block compressed but also cube, 3D, etc. Most game developers design their own file formats for this purpose anyway, to load them straight into GPU memory with no conversion.
4. Internal GPU texture compression. Pixels of a texture may not be stored in video memory the way you think - row-major order, one pixel after the other, R8G8B8A8 or whatever format you chose. When you create a texture with D3D12_TEXTURE_LAYOUT_UNKNOWN
/ VK_IMAGE_TILING_OPTIMAL
(always do that, except for some very special cases), the GPU is free to use some optimized internal format. This may not be true "compression" by its definition, because it must be lossless, so the memory reserved for the texture will not be smaller. It may even be larger because of the requirement to store additional metadata. (That's why you have to take care of extra VK_IMAGE_ASPECT_METADATA_BIT
when working with sparse textures in Vulkan.) The goal of these formats is to speed up access to the texture.
Details of these formats are specific to GPU vendors and may or may not be public. Some ideas of how a GPU could optimize a texture in its memory include:
How to make best use of those internal GPU compression formats if they differ per graphics card vendor and we don't know their details? Just make sure you leave the driver as much optimization opportunities as possible by:
D3D12_TEXTURE_LAYOUT_UNKNOWN
/ VK_IMAGE_TILING_OPTIMAL
,D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET
, D3D12_RESOURCE_FLAG_ALLOW_DEPTH_STENCIL
, D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS
, D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS
/ VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT
, VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT
, VK_IMAGE_USAGE_STORAGE_BIT
, VK_SHARING_MODE_CONCURRENT
for any textures that don't need them,DXGI_FORMAT_*_TYPELESS
/ VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT
for any textures that don't need them,D3D12_RESOURCE_STATE_COMMON
/ VK_IMAGE_LAYOUT_GENERAL
.See also article Delta Color Compression Overview at GPUOpen.com.
Summary: As you can see, the term "texture compression" can mean different things, so when talking about anything like this, always make sure to be clear what do you mean unless it's obvious from the context.
Comments | #rendering #vulkan #directx Share
# Secrets of Direct3D 12: Copies to the Same Buffer
Wed
04
Mar 2020
Modern graphics APIs (D3D12, Vulkan) are complicated. They are designed to squeeze maximum performance out of graphics cards. GPUs are so fast at rendering not because they work with high clock frequencies (actually they don't - frequency of 1.5 GHz is high for a GPU, as opposed to many GHz on a CPU), but because they execute their workloads in a highly parallel and pipelined way. In other words: many tasks may be executed at the same time. To make it working correctly, we must manually synchronize them using barriers. At least sometimes...
Let's consider few scenarios. Scenario 1: A draw call rendering to a texture as a Render Target View (RTV), followed by a draw call sampling from this texture as a Shader Resource View (SRV). We know we must put a D3D12_RESOURCE_BARRIER_TYPE_TRANSITION
barrier in between them to transition the texture from D3D12_RESOURCE_STATE_RENDER_TARGET
to D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE
.
Scenario 2: Two subsequent compute shader dispatches, executed in one command list, access the same texture as an Unordered Access View (UAV). The texture stays in D3D12_RESOURCE_STATE_UNORDERED_ACCESS
, but still if the second dispatch needs to wait for the first one to finish, we must issue a barrier of special type D3D12_RESOURCE_BARRIER_TYPE_UAV
. That's what this type of barrier was created for.
Scenario 3: Two subsequent draw calls rendering to the same texture as a Render Target View (RTV). The texture stays in the same state D3D12_RESOURCE_STATE_RENDER_TARGET
. We need not put a barrier between them. The draw calls are free to overlap in time, but GPU has its own ways to guarantee that multiple writes to the same pixel will always happen in the order of draw calls, and even more - in the order of primitives as given in index + vertex buffer!
Now to scenario 4, the most interesting one: Two subsequent copies to the same resource. Let's say we work with buffers here, just for simplicity, but I suspect textures work same way. What if the copies affect the same or overlapping regions of the destination buffer? Do they always execute in order, or can they overlap in time? Do we need to synchronize them to get proper result? What if some copies are fast, made from another buffer in GPU memory (D3D12_HEAP_TYPE_DEFAULT
) and some are slow, accessing system memory (D3D12_HEAP_TYPE_UPLOAD
) through PCI-Express bus? What if the card uses a compute shader to perform the copy? Isn't this the same as scenario 2?
That's a puzzle that my colleague asked recently. I didn't know the immediate answer to it, so I wrote a simple program to test this case. I prepared two buffers: gpuBuffer
placed in DEFAULT heap and cpuBuffer
placed in UPLOAD heap, 120 MB each, both filled with some distinct data and both transitioned to D3D12_RESOURCE_STATE_COPY_SOURCE
. I then created another buffer destBuffer
to be the destination of my copies. During the test I executed few CopyBufferRegion
calls, from one source buffer or the other, small or large number of bytes. I then read back destBuffer and checked if the result is valid.
g_CommandList->CopyBufferRegion(destBuffer, 5 * (10 * 1024 * 1024),
gpuBuffer, 5 * (10 * 1024 * 1024), 4 * (10 * 1024 * 1024));
g_CommandList->CopyBufferRegion(destBuffer, 3 * (10 * 1024 * 1024),
cpuBuffer, 3 * (10 * 1024 * 1024), 4 * (10 * 1024 * 1024));
g_CommandList->CopyBufferRegion(destBuffer, SPECIAL_OFFSET,
gpuBuffer, 102714720, 4);
g_CommandList->CopyBufferRegion(destBuffer, SPECIAL_OFFSET,
cpuBuffer, 102714720, 4);
It turned out it is! I checked it on both AMD (Radeon RX 5700 XT) and NVIDIA card (GeForce GTX 1070). The driver serializes such copies, making sure they execute in order and the destination data is as expected even when memory regions written by the copy operations overlap.
I also made a capture using Radeon GPU Profiler (RGP) and looked at the graph. The copies are executed as a compute shader, large ones are split into multiple events, but after each copy there is an implicit barrier inserted by the driver, described as:
CmdBarrierBlitSync()
The AMD driver issued a barrier in between back-to-back blit operations to the same destination resource.
I think it explains everything. If the driver had to insert such a barrier, we can suspect it is required. I only can't find anything in the Direct3D documentation that would explicitly specify this behavior. If you find it, please let me know - e-mail me or leave a comment under this post.
Maybe we could insert a barrier manually in between these copies, just to make sure? Nope, there is no way to do it. I tried two different ways:
1. A UAV barrier like this:
D3D12_RESOURCE_BARRIER uavBarrier = {};
uavBarrier.Type = D3D12_RESOURCE_BARRIER_TYPE_UAV;
uavBarrier.UAV.pResource = destBuffer;
g_CommandList->ResourceBarrier(1, &uavBarrier);
It triggers D3D Debug Layer error that complains about the buffer not having UAV among its flags:
D3D12 ERROR: ID3D12GraphicsCommandList::ResourceBarrier: Missing resource bind flags. [ RESOURCE_MANIPULATION ERROR #523: RESOURCE_BARRIER_MISSING_BIND_FLAGS]
2. A transition barrier from COPY_DEST to COPY_DEST:
D3D12_RESOURCE_BARRIER transitionBarrier = {};
transitionBarrier.Type = D3D12_RESOURCE_BARRIER_TYPE_TRANSITION;
transitionBarrier.Transition.pResource = destBuffer;
transitionBarrier.Transition.StateBefore = D3D12_RESOURCE_STATE_COPY_DEST;
transitionBarrier.Transition.StateAfter = D3D12_RESOURCE_STATE_COPY_DEST;
transitionBarrier.Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
g_CommandList->ResourceBarrier(1, &transitionBarrier);
Bad luck again. This time the Debug Layer complains about "before" and "after" states having to be different.
D3D12 ERROR: ID3D12CommandList::ResourceBarrier: Before and after states must be different. [ RESOURCE_MANIPULATION ERROR #525: RESOURCE_BARRIER_MATCHING_STATES]
Bonus scenario 5: ClearRenderTargetView
, followed by a draw call that renders to the same texture as a Render Target View. The texture needs to be in D3D12_RESOURCE_STATE_RENDER_TARGET
for both operations. We don't put a barrier in between them and don't even have a way to do it, just like in the scenario 4. So Clear operations must also guarantee the order of their execution, although I can't find anything about it in the DX12 spec.
What a mess! It seems that Direct3D 12 requires putting explicit barriers between our commands sometimes, automatically synchronizes some others, and doesn't even describe it all clearly in the documentation. The only general rule I can think of is that it cannot track resources bound through descriptors (like SRV, UAV), but tracks those that are bound in a more direct way (as render target, depth-stencil, clear target, copy destination) and synchronizes them automatically. I hope this post helped to clarify some situations that my happen in your rendering code.
Comments | #directx #rendering Share
# How to Correctly Interpolate Vertex Attributes on a Parallelogram Using Modern GPUs?
Thu
13
Feb 2020
This is probably first guest post ever on my blog. It has been written by my friend Łukasz Izdebski Ph.D.
In a nutshell, today’s graphics cards render meshes using only triangles, as shown on the picture below.
I don’t want to describe how it is done (a lot of information on this topic can be easily found on the Internet) and describe the whole graphics pipeline, but for a short recap: In the first programmable part of the rendering pipeline, vertex shader receives a single vertex (with assigned data about it that I will later refer to) and outputs one vertex which is transformed by the shader. Usually it is a transformation from 3D coordinate system to Normalized Device Coordinates (NDC) (information about NDC can be found in Coordinate Systems). After primitive clipping, perspective divide, and viewport transform, vertices are projected on the 2D screen, which is drawn to the monitor output.
But this is only half the story. I was describing what is happening with vertices, but at the beginning I mentioned triangles, with three vertices forming one triangle. After vertex processing, Primitive Assembly follows, then the next stage is Rasterization. This is very important because in this stage the generation of fragments (pixels) happens, which are lying inside the triangle, as shown on the picture below.
How are colors of the pixels inside the triangle generated? Each vertex can contain not only one set of data - 3D coordinates in the virtual world, but also additional data called attributes. Those attributes can be color of the vertex, normal vector, texture coordinate etc.
How then those rainbow colors are generated as shown on the picture above, while as I said only one color can be set on three vertices of the triangle? The answer is interpolation. As we can read in Wikipedia, interpolation in mathematics is a type of estimation, a method that can be used to generate new data points between a discrete set of known data points. In the described problem it’s about generating interpolated colors inside the rendered triangle.
The way to achieve this is by using barycentric coordinates. (Those coordinates not only can be used for interpolation but additionally define if a fragment lies inside the triangle or not. More on this topic can be read in Rasterization: a Practical Implementation). In short, barycentric coordinate system is a set of three points λ1, λ2, λ3 ≥ 0 (for convex polygons) where λ1 + λ2 + λ3 = 1. When a triangle is rasterized, for every fragment of the triangle proper barycentric coordinates are calculated. Then the color of the fragment C can be calculated by weighted sum of colors at each vertex C1, C2, C3, and weights are of course barycentric coordinates: C = C1 * λ1 + C2 * λ2 + C3 * λ3
This way not only color can be interpolated, but any vertex attribute. (This functionality can be disabled by using proper Interpolation Qualifier on an attribute in vertex shader source code, like flat
).
When dealing with triangle meshes, this way of attribute interpolation gives correct results, but when rendering 2D sprites or font glyphs, some disadvantages may occur under specific circumstances. When we want to render a gradient which starts in one of the corners of a sprite (see the picture below), we can see quite ugly results :(
This is happening because interpolation occurs on two triangles independent of each other. Graphics cards can work only on triangles, not quads. In this case we want the interpolation to occur on a quad not triangle, as pictured below:
How to trick the graphics card and force quadrilateral attributes interpolation? One way to render such type of gradients is by using tessellation – subdivide quad geometry. Tessellation shader is available starting from DirectX 11, OpenGL 4.0, and Vulkan 1.0. A simple example of how it will look like depending on different parameters of tessellation (more details about tessellation can be found in Tessellation Stages) can be seen in the animated picture below.
As we can see, when the quad is subdivided to more than 16 pieces, we are getting the desired visual result, but as my high school teacher used to say: “Don’t shoot a fly with a cannon”, and this solution for rendering such a simple thing by using tessellation is an overkill.
This way I developed a new technique that will be helpful to achieve this goal. First, we need to get access to the barycentric coordinates in the fragment shader. DirectX 12 HLSL gives us those coordinates when using SV_Barycentrics. In Vulkan those coordinates are available in AMD VK_AMD_shader_explicit_vertex_parameter extension and Nvidia VK_NV_fragment_shader_barycentric extension. Maybe in the near future it will be available in the core spec of Vulkan and more importantly from all hardware vendors.
If we are not fortunate to have those coordinates as built-in functions, we can generate them by adding some additional data: one new vertex attribute and one new uniform (constant) value. Here are the details of this solution. Consider a quadrilateral built from four vertices and two triangles as shown in the picture below.
Additional attribute Barycentric
in vertices is a 2D vector and should contain following values:
A = (1,0) B = (0,0) C = (0,1) D = (0,0)
Next step is to calculate extra constant data for the parameter to be interpolated as shown in the picture above (in this case color attribute), using the equation:
ExtraColorData = - ColorAtVertexA + ColorAtVertexB - ColorAtVertexC + ColorAtVertexD
The fragment shader that renders the interpolation we are looking for looks like this:
///////////////////////////////////////////////////////////////////// //GLSL Fragment Shader ///////////////////////////////////////////////////////////////////// #version 450 #extension GL_ARB_separate_shader_objects : enable layout(binding = 0) uniform CONSTANT_BUFFER { vec4 ExtraColorData; } cbuffer; in block { vec4 Color; vec2 Barycentric; } PSInput; layout(location = 0) out vec4 SV_TARGET; void main() { SV_TARGET = PSInput.Color + PSInput.Barycentric.x * PSInput.Barycentric.y * cbuffer.ExtraColorData; } ///////////////////////////////////////////////////////////////////// //HLSL Pixel Shader ///////////////////////////////////////////////////////////////////// cbuffer CONSTANT_BUFFER : register(b0) { float4 ExtraColorData; }; struct PSInput { float4 color : COLOR; float2 barycentric : TEXCOORD0; }; float4 PSMain(PSInput input) : SV_TARGET { return input.color + input.barycentric.x * input.barycentric.y * ExtraColorData; }
That’s all. As we can see, it should not have a big performance overhead. When barycentric coordinates will be more available, then memory overhead will also be minimal.
Probably the reader will ask the question if this is looking good in the 3D perspective scenario, not only when a triangle is parallel to the screen?
As shown in the picture above, hardware properly interpolates data using additional computation as describe here (Rasterization: a Practical Implementation), so this new method works with perspective as it should.
Does this method give proper results only on squares?
This is a good question! The solution described above works on all parallelograms. I’m now working on a solution for all convex quadrilaterals.
What else can we use this method for?
One more usage comes to my mind: a post-process fullscreen quad. As I mentioned earlier, graphics cards do not render quads, but triangles. To simulate proper interpolation of attributes, 3D engines render one BIG triangle which covers the whole screen. With this new approach, rendering quad built from two triangles can be available and attributes which are needed to be quadrilateral interpolated can be calculated in the way shown above.
Comments | #math #rendering Share
# How Do Graphics Cards Execute Vector Instructions?
Sun
19
Jan 2020
Intel announced that together with their new graphics architecture they will provide a new API, called oneAPI, that will allow to program GPU, CPU, and even FPGA in an unified way, and will support SIMD as well as SIMT mode. If you are not sure what does it mean but you want to be prepared for it, read this article. Here I try to explain concepts like SIMD, SIMT, AoS, SoA, and the vector instruction execution on CPU and GPU. I think it may interest to you as a programmer even if you don't write shaders or GPU computations. Also, don't worry if you don't know any assembly language - the examples below are simple and may be understandable to you, anyway. Below I will show three examples:
1. CPU, scalar
Let's say we write a program that operates on a numerical value. The value comes from somewhere and before we pass it for further processing, we want to execute following logic: if it's negative (less than zero), increase it by 1. In C++ it may look like this:
float number = ...;
bool needsIncrease = number < 0.0f;
if(needsIncrease)
number += 1.0f;
If you compile this code in Visual Studio 2019 for 64-bit x86 architecture, you may get following assembly (with comments after semicolon added by me):
00007FF6474C1086 movss xmm1,dword ptr [number] ; xmm1 = number
00007FF6474C108C xorps xmm0,xmm0 ; xmm0 = 0
00007FF6474C108F comiss xmm0,xmm1 ; compare xmm0 with xmm1, set flags
00007FF6474C1092 jbe main+32h (07FF6474C10A2h) ; jump to 07FF6474C10A2 depending on flags
00007FF6474C1094 addss xmm1,dword ptr [__real@3f800000 (07FF6474C2244h)] ; xmm1 += 1
00007FF6474C109C movss dword ptr [number],xmm1 ; number = xmm1
00007FF6474C10A2 ...
There is nothing special here, just normal CPU code. Each instruction operates on a single value.
2. CPU, vector
Some time ago vector instructions were introduced to CPUs. They allow to operate on many values at a time, not just a single one. For example, the CPU vector extension called Streaming SIMD Extensions (SSE) is accessible in Visual C++ using data types like __m128
(which can store 128-bit value representing e.g. 4x 32-bit floating-point numbers) and intrinsic functions like _mm_add_ps
(which can add two such variables per-component, outputting a new vector of 4 floats as a result). We call this approach Single Instruction Multiple Data (SIMD), because one instruction operates not on a single numerical value, but on a whole vector of such values in parallel.
Let's say we want to implement following logic: given some vector (x, y, z, w) of 4x 32-bit floating point numbers, if its first component (x) is less than zero, increase the whole vector per-component by (1, 2, 3, 4). In Visual C++ we can implement it like this:
const float constant[] = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 number = ...;
float x; _mm_store_ss(&x, number);
bool needsIncrease = x < 0.0f;
if(needsIncrease)
number = _mm_add_ps(number, _mm_loadu_ps(constant));
Which gives following assembly:
00007FF7318C10CA comiss xmm0,xmm1 ; compare xmm0 with xmm1, set flags
00007FF7318C10CD jbe main+69h (07FF7318C10D9h) ; jump to 07FF7318C10D9 depending on flags
00007FF7318C10CF movaps xmm5,xmmword ptr [__xmm@(...) (07FF7318C2250h)] ; xmm5 = (1, 2, 3, 4)
00007FF7318C10D6 addps xmm5,xmm1 ; xmm5 = xmm5 + xmm1
00007FF7318C10D9 movaps xmm0,xmm5 ; xmm0 = xmm5
This time xmm
registers are used to store not just single numbers, but vectors of 4 floats. A single instruction - addps
(as opposed to addss
used in the previous example) adds 4 numbers from xmm1
to 4 numbers in xmm5
.
It may seem obvious, but it's important for future considerations to note that the condition here and the boolean variable driving it (needsIncrease
) is not a vector, but a single value, calculated based on the first component of vector number
. Such a single value in the SIMD world is also called a "scalar". Based on it, the condition is true or false and the branch is taken or not, so either the whole vector is increased by (1, 2, 3, 4), or nothing happens. This is how CPUs work, because we execute just one program, with one thread, which has one instruction pointer to execute its instructions sequentially.
3. GPU
Now let's move on from CPU world to the world of a graphic processor (GPU). Those are programmed in different languages. One of them is GLSL, used in OpenGL and Vulkan graphics APIs. In this language there is also a data type that holds 4x 32-bit floating-point numbers, called vec4
. You can add a vector to a vector per-component using just '+' operator.
Same logic as in section 2. implemented in GLSL looks like this:
vec4 number = ...;
bool needsIncrease = number.x < 0.0;
if(needsIncrease)
number += vec4(1.0, 2.0, 3.0, 4.0);
When you compile a shader with such code for an AMD GPU, you may see following GPU assembly: (For offline shader compilation I used Radeon GPU Analyzer (RGA) - free tool from AMD.)
v_add_f32 v5, 1.0, v2 ; v5 = v2 + 1
v_add_f32 v1, 2.0, v3 ; v1 = v3 + 2
v_cmp_gt_f32 vcc, 0, v2 ; compare v2 with 0, set flags
v_cndmask_b32 v2, v2, v5, vcc ; override v2 with v5 depending on flags
v_add_f32 v5, lit(0x40400000), v4 ; v5 = v4 + 3
v_cndmask_b32 v1, v3, v1, vcc ; override v1 with v3 depending on flags
v_add_f32 v3, 4.0, v0 ; v3 = v0 + 4
v_cndmask_b32 v4, v4, v5, vcc ; override v4 with v5 depending on flags
v_cndmask_b32 v3, v0, v3, vcc ; override v3 with v0 depending on flags
You can see something interesting here: Despite high level shader language is vector, the actual GPU assembly operates on individual vector components (x, y, z, w) using separate instructions and stores their values in separate registers like (v2, v3, v4, v0). Does it mean GPUs don't support vector instructions?!
Actually, they do, but differently. First GPUs from decades ago (right after they became programmable with shaders) really operated on those vectors in the way we see them. Nowadays, it's true that what we treat as vector components (x, y, z, w) or color components (R, G, B, A) in the shaders we write, becomes separate values. But GPU instructions are still vector, as denoted by their prefix "v_". The SIMD in GPUs is used to process not a single vertex or pixel, but many of them (e.g. 64) at once. It means that a single register like v2
stores 64x 32-bit numbers and a single instruction like v_add_f32
adds per-component 64 of such numbers - just Xs or Ys or Zs or Ws, one for each pixel calculated in a separate SIMD lane.
Some people call it Structure of Arrays (SoA) as opposed to Array of Structures (AoS). This term comes from an imagination of how the data structure as stored in memory could be defined. If we were to define such data structure in C, the way we see it when programming in GLSL is array of structures:
struct {
float x, y, z, w;
} number[64];
While the way the GPU actually operates is kind of a transpose of this - a structure of arrays:
struct {
float x[64], y[64], z[64], w[64];
} number;
It comes with an interesting implication if you consider the condition we do before the addition. Please note that we write our shader as if we calculated just a single vertex or pixel, without even having to know that 64 of them will execute together in a vector manner. It means we have 64 Xs, Ys, Zs, and Ws. The X component of each pixel can be less or not less than 0, meaning that for each SIMD lane the condition may be fulfilled or not. So boolean variable needsIncrease
inside the GPU is not a scalar, but also a vector, having 64 individual boolean values - one for each pixel! Each pixel may want to enter the if
clause or skip it. That's what we call Single Instruction Multiple Threads (SIMT), and that's how real modern GPUs operate. How is it implemented if some threads want to do if
and others want to do else
? That's a different story...