Tag: gpu

Entries for tag "gpu", ordered from most recent. Entry count: 21.

Pages: 1 2 3 >

# First Look at New D3D12 Enhanced Barriers

Dec 2021

This will be pretty advanced or at least intermediate article. It assumes you know Direct3D 12 API. Some references to Vulkan may also appear. I am writing it because I just found out that yesterday Microsoft announced an upcoming big change in D3D12: Enhanced Barriers. It will be an addition to the API that provides a new way to do barriers. Considering my professional interests, this looks very important to me and also quite revolutionary. This article summarizes my first look and my thoughts about this new addition to the API or, speaking in terms of modern internet, my "unboxing" or "reaction" ;)

Bill Kristiansen, the author of the article linked above, written that currently only the software-simulated WARP device supports the new enhanced barriers. Support in real GPU drivers will come at later time. The new barriers can replace the old way of doing them, but both will still be available and can also be mixed in one application. Which means this is not as big revolution to turn our DirectX development upside down - we can switch to them gradually. For now we can just prepare ourselves for the future by studying the interface (which I do in this article) and testing some code using WARP device.

UPDATE 2021-12-10: I just learned that Microsoft actually did publish a documentation of the new API: Enhanced Barriers @ DirectX-Specs, so I recommend to go see it before reading this article.

Read full entry > | Comments | #directx #vulkan #gpu Share

# Creative Use of GPU Fixed-Function Hardware

Sep 2021

I recently broke my rule of posting on my blog at least once a month as I had some other topics and problems to handle in my life, but I'm still alive, still doing graphics programming for a living, so I hope to get back to blogging now. This post is more like a question rather than an answer. It is about creative use of GPU fixed-function hardware. Warning: It may be pretty difficult for beginners, full of graphics programming terms you should already know to understand it. But first, here is some background:

I remember the times when graphics cards were only configurable, not programmable. There were no shaders, only a set of parameters that could control pre-defined operations - transform of vertices, texturing and lighting of pixels. Then, shaders appeared. They evolved by supporting more instructions to be executed and a wider variety of instructions available. At some point, even before the invention of compute shaders, the term “general-purpose computing on GPU” (GPGPU) appeared. Developers started encoding some data as RGBA colors of texture pixels and drawing full-screen quads just to launch calculation of some non-graphical tasks, implemented as pixel shaders. Soon after, compute shaders appeared, so they no longer need to pretend anything - they can now spawn a set of threads that can just read and write memory freely through Direct3D unordered access views aka Vulkan storage images and buffers.

GPUs seem to become more universal over time, with more and more workloads done as compute shaders these days. Will we end up with some generic, highly parallel compute machines with no fixed-function hardware? I don’t know. But Nanite technology from the new Unreal Engine 5 makes a step in this direction by implementing its own rasterizer for some of its triangles, in form of a compute shader. I recommend a good article about it: “A Macro View of Nanite – The Code Corsair” (it seems the link is broken already - here is a copy on Wayback Machine Internet Archive). Apparently, for tiny triangles of around single pixel size, custom rasterization is faster than what GPUs provide by default.

But in the same article we can read that Epic also does something opposite in Nanite: they use some fixed-function parts of the graphics pipeline very creatively. When applying materials in screen space, they render a full-screen pass per each material, but instead of drawing just a full-screen triangle, they do a regular triangle grid with quads covering tiles of NxN pixels. They then perform a coarse-grained culling of these tiles in a vertex shader. In order to reject one, they output vertex position = NaN, which makes a triangle incorrect and not spawning any pixels. Then, a more fine-grained culling is performed using Z-test. Per-pixel material identifier is encoded as depth in a depth buffer! This can be fast, as modern GPUs apply “HiZ” - an internal optimization to reject whole groups of pixels that fail Z-test even before their pixel shaders are launched.

This reminded me of another creative use of the graphics pipeline I observed in one game a few years ago. That pass was calculating luminance histogram of a scene. They also rendered a regular grid of geometry in screen space, but with “point list” topology. Each vertex was sampling and calculating average luminance of its region. On the other end, the histogram texture of Nx1 pixels was bound as a render target. Measured luminance of a region was returned as vertex position, while incrementation of the specific place on the histogram was ensured using additive blending. I suspect this is not the most optimal way of doing this, a compute shader using atomics could probably do it faster, but it surely was very creative and took me some time to figure out what that pass is really doing and how is it doing it.

After all, GPUs have many fixed-function elements next to their shader cores. Vertex fetch, texture sampling (with mip level calculation, trilinear and anisotropic filtering), tessellation, rasterization, blending, all kinds of primitive culling and pixel testing, even vertex homogeneous divide... Although not included in the calculation of TFLOPS power, these are real transistors with compute capabilities, just very specialized. Do you know any other smart, creative uses of them?

Comments | #rendering #optimization #gpu Share

# A Better Way to Scalarize a Shader

Oct 2020

This will be an advanced article. It assumes you not only know how to write shaders but also how they work on a low level (like vector versus scalar registers) and how to optimize them using scalarization. It all starts from a need to index into an array of texture or buffer descriptors, where the index is dynamic – it may vary from pixel to pixel. This is useful e.g. when doing bindless-style rendering or blending various layers of textures e.g. on a terrain. To make it working properly in a HLSL shader, you need to surround the indexing operation with a pseudo-function NonUniformResourceIndex. See also my old blog post “Direct3D 12 - Watch out for non-uniform resource index!”.

Texture2D g_Textures[] : register(t1);
return g_Textures[NonUniformResourceIndex(textureIndex)].Load(pos);

In many cases, it is enough. The driver will do its magic to make things working properly. But if your logic dependent on textureIndex is more complex than a single Load or SampleGrad, e.g. you sample multiple textures or do some calculations (let's call it MyDynamicTextureIndexing), then it might be beneficial to scalarize the shader manually using a loop and wave functions from HLSL Shader Model 6.0.

I learned how to do scalarization from the 2-part article “Intro to GPU Scalarization” by Francesco Cifariello Ciardi and the presentation “Improved Culling for Tiled and Clustered Rendering” by MichaƂ Drobot, linked from it. Both sources propose an implementation like the following HLSL snippet:

float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
while(any(currThreadMask & activeThreadsMask) != 0)
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   uint2 scalarTextureIndexThreadMask = WaveActiveBallot(scalarTextureIndex == textureIndex).xy;
   activeThreadsMask &= ~scalarTextureIndexThreadMask;
   if(scalarTextureIndex == textureIndex)
       color = MyDynamicTextureIndexing(textureIndex);
return color;

It involves a bit mask of active threads. From the moment I first saw this code, I started wondering: Why is it needed? A mask of threads that still want to continue spinning the loop is already maintained implicitly by the shader compiler. Couldn't we just break; from the loop when done with the textureIndex of the current thread?! So I wrote this short piece of code:

float4 color = float4(0.0, 0.0, 0.0, 0.0);
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   if(scalarTextureIndex == textureIndex)
       color = MyDynamicTextureIndexing(textureIndex);
return color;

…and it crashed my GPU. At first I thought it may be a bug in the shader compiler, but then I recalled footnote [2] in part 2 of the scalarization tutorial, which mentions an issue with helper lanes. Let me elaborate on this. When a shader is executed in SIMT fashion, individual threads (lanes) may be active or inactive. Active lanes are these that do their job. Inactive lanes may be inactive from the very beginning because we are at the edge of a triangle so there are not enough pixels to make use of all the lanes or may be disabled temporarily because e.g. we are executing an if section that some threads didn't want to enter. But in pixel shaders there is a third kind of lanes – helper lanes. These are used instead of inactive lanes to make sure full 2x2 quads always execute the code, which is needed to calculate derivatives ddx/ddy, also done explicitly when sampling a texture to calculate the correct mip level. A helper lane executes the code (like an active lane), but doesn't export its result to the render target (like an inactive lane).

As it turns out, helper lanes also don't contribute to wave functions – they work like inactive lanes. Can you already see the problem here? In the loop shown above, it may happen than a helper lane has its textureIndex different from any active lanes within a wave. It will then never get its turn to process it in a scalar fashion, so it will fall into an infinite loop, causing GPU crash (TDR)!

Then I thought: What if I disable helper lanes just once, before the whole loop? So I came up with the following shader. It seems to work fine. I also think it is better than the first solution, as it operates on the thread bit mask only once at the beginning and so uses fewer variables to be stored in GPU registers and does fewer calculations in every loop iteration. Now I'm thinking whether there is something wrong with my idea that I can't see now? Or did I just invent a better way to scalarize shaders?

float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
if(any((currThreadMask & activeThreadsMask) != 0))
       uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
       if(scalarTextureIndex == textureIndex)
           color = MyDynamicTextureIndexing(textureIndex);
return color;

UPDATE 2020-10-28: There are some valuable comments under my tweet about this topic that I recommend to check out.

UPDATE 2021-12-03: As my colleague pointed out today, the code I showed above as "BAD" is perfectly fine for compute shaders. Only in pixel shaders we have problems with helper lanes. Thank you Steffen!

Comments | #directx #optimization #gpu Share

# Which Values Are Scalar in a Shader?

Oct 2020

GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it cbScaleFactor, or any calculation on such data, like (cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.

Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like SV_Position in a pixel shader (denoting the position of the current pixel on the screen), SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.

But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.

Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.

This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like WaveReadLaneFirst, WaveActiveMax, WaveActiveAllTrue return the same value for all threads, so it’s always scalar.

Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.

Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed LaneIndex to be any value by making this GitHub commit to their documentation.

If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.

SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.

SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.

SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using [numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases, SV_GroupThreadID.y wouldn’t be scalar.

Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.

Comments | #gpu #directx #vulkan #optimization Share

# System Value Semantics in Compute Shaders - Cheat Sheet

Sep 2020

After compute shaders appeared, programmers no longer need to pretend they do graphics and render pixels when they want to do some general-purpose computations on a GPU (GPGPU). They can just dispatch a shader that reads and writes memory in a custom way. Such shader is a short (or not so short) program to be invoked thousands or millions of times to process a piece of data. To work correctly, it needs to know which is the current thread. Threads (invocations) of a compute shader are not just indexed linearly as 0, 1, 2, ... It's more complex than that. Their indexing can use up to 3 dimensions, which simplifies operation on some data like images or matrices. They also come in groups, with the number of threads in one group declared statically as part of the shader code and the number of groups to execute passed dynamically in CPU code when dispatching the shader.

This raises a question of how to identify the current thread. HLSL offers a number of system-value semantics for this purpose and so does GLSL by defining equivalent built-in variables. For long time I couldn't remember their names, as the ones in HLSL are quite misleading. If GroupID is an ID of the entire group, and GroupThreadID is an ID of the thread within a group, GroupIndex should be a flattened index of the entire group, right? Wrong! It's actually an index of a single thread within a group. GLSL is more consistent in this regard, clearly stating "WorkGroup" versus "Invocation" and "Local" versus "Global". So, although Microsoft provides a great explanation of their SVs with a picture on pages like SV_DispatchThreadID, I thought it would be nice to gather all this in form of a table, a small cheat sheet:

HLSL SemanticsGLSL VariableType (Dimension)UnitReference
SV_GroupIDgl_WorkGroupIDuint3 (3D)Entire groupGlobal in dispatch
SV_GroupThreadIDgl_LocalInvocationIDuint3 (3D)Single threadLocal in group
SV_DispatchThreadIDgl_GlobalInvocationIDuint3 (3D)Single threadGlobal in dispatch
SV_GroupIndexgl_LocalInvocationIndexuint (flattened)Single threadLocal in group

Comments | #gpu #directx #opengl #vulkan Share

# How Do Graphics Cards Execute Vector Instructions?

Jan 2020

Intel announced that together with their new graphics architecture they will provide a new API, called oneAPI, that will allow to program GPU, CPU, and even FPGA in an unified way, and will support SIMD as well as SIMT mode. If you are not sure what does it mean but you want to be prepared for it, read this article. Here I try to explain concepts like SIMD, SIMT, AoS, SoA, and the vector instruction execution on CPU and GPU. I think it may interest to you as a programmer even if you don't write shaders or GPU computations. Also, don't worry if you don't know any assembly language - the examples below are simple and may be understandable to you, anyway. Below I will show three examples:

1. CPU, scalar

Let's say we write a program that operates on a numerical value. The value comes from somewhere and before we pass it for further processing, we want to execute following logic: if it's negative (less than zero), increase it by 1. In C++ it may look like this:

float number = ...;
bool needsIncrease = number < 0.0f;
 number += 1.0f;

If you compile this code in Visual Studio 2019 for 64-bit x86 architecture, you may get following assembly (with comments after semicolon added by me):

00007FF6474C1086 movss  xmm1,dword ptr [number]   ; xmm1 = number
00007FF6474C108C xorps  xmm0,xmm0                 ; xmm0 = 0
00007FF6474C108F comiss xmm0,xmm1                 ; compare xmm0 with xmm1, set flags
00007FF6474C1092 jbe    main+32h (07FF6474C10A2h) ; jump to 07FF6474C10A2 depending on flags
00007FF6474C1094 addss  xmm1,dword ptr [__real@3f800000 (07FF6474C2244h)]  ; xmm1 += 1
00007FF6474C109C movss  dword ptr [number],xmm1   ; number = xmm1
00007FF6474C10A2 ...

There is nothing special here, just normal CPU code. Each instruction operates on a single value.

2. CPU, vector

Some time ago vector instructions were introduced to CPUs. They allow to operate on many values at a time, not just a single one. For example, the CPU vector extension called Streaming SIMD Extensions (SSE) is accessible in Visual C++ using data types like __m128 (which can store 128-bit value representing e.g. 4x 32-bit floating-point numbers) and intrinsic functions like _mm_add_ps (which can add two such variables per-component, outputting a new vector of 4 floats as a result). We call this approach Single Instruction Multiple Data (SIMD), because one instruction operates not on a single numerical value, but on a whole vector of such values in parallel.

Let's say we want to implement following logic: given some vector (x, y, z, w) of 4x 32-bit floating point numbers, if its first component (x) is less than zero, increase the whole vector per-component by (1, 2, 3, 4). In Visual C++ we can implement it like this:

const float constant[] = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 number = ...;
float x; _mm_store_ss(&x, number);
bool needsIncrease = x < 0.0f;
 number = _mm_add_ps(number, _mm_loadu_ps(constant));

Which gives following assembly:

00007FF7318C10CA  comiss xmm0,xmm1  ; compare xmm0 with xmm1, set flags
00007FF7318C10CD  jbe    main+69h (07FF7318C10D9h)  ; jump to 07FF7318C10D9 depending on flags
00007FF7318C10CF  movaps xmm5,xmmword ptr [__xmm@(...) (07FF7318C2250h)]  ; xmm5 = (1, 2, 3, 4)
00007FF7318C10D6  addps  xmm5,xmm1  ; xmm5 = xmm5 + xmm1
00007FF7318C10D9  movaps xmm0,xmm5  ; xmm0 = xmm5

This time xmm registers are used to store not just single numbers, but vectors of 4 floats. A single instruction - addps (as opposed to addss used in the previous example) adds 4 numbers from xmm1 to 4 numbers in xmm5.

It may seem obvious, but it's important for future considerations to note that the condition here and the boolean variable driving it (needsIncrease) is not a vector, but a single value, calculated based on the first component of vector number. Such a single value in the SIMD world is also called a "scalar". Based on it, the condition is true or false and the branch is taken or not, so either the whole vector is increased by (1, 2, 3, 4), or nothing happens. This is how CPUs work, because we execute just one program, with one thread, which has one instruction pointer to execute its instructions sequentially.

3. GPU

Now let's move on from CPU world to the world of a graphic processor (GPU). Those are programmed in different languages. One of them is GLSL, used in OpenGL and Vulkan graphics APIs. In this language there is also a data type that holds 4x 32-bit floating-point numbers, called vec4. You can add a vector to a vector per-component using just '+' operator.

Same logic as in section 2. implemented in GLSL looks like this:

vec4 number = ...;
bool needsIncrease = number.x < 0.0;
 number += vec4(1.0, 2.0, 3.0, 4.0);

When you compile a shader with such code for an AMD GPU, you may see following GPU assembly: (For offline shader compilation I used Radeon GPU Analyzer (RGA) - free tool from AMD.)

v_add_f32      v5, 1.0, v2      ; v5 = v2 + 1
v_add_f32      v1, 2.0, v3      ; v1 = v3 + 2
v_cmp_gt_f32   vcc, 0, v2       ; compare v2 with 0, set flags
v_cndmask_b32  v2, v2, v5, vcc  ; override v2 with v5 depending on flags
v_add_f32      v5, lit(0x40400000), v4  ; v5 = v4 + 3
v_cndmask_b32  v1, v3, v1, vcc  ; override v1 with v3 depending on flags
v_add_f32      v3, 4.0, v0      ; v3 = v0 + 4
v_cndmask_b32  v4, v4, v5, vcc  ; override v4 with v5 depending on flags
v_cndmask_b32  v3, v0, v3, vcc  ; override v3 with v0 depending on flags

You can see something interesting here: Despite high level shader language is vector, the actual GPU assembly operates on individual vector components (x, y, z, w) using separate instructions and stores their values in separate registers like (v2, v3, v4, v0). Does it mean GPUs don't support vector instructions?!

Actually, they do, but differently. First GPUs from decades ago (right after they became programmable with shaders) really operated on those vectors in the way we see them. Nowadays, it's true that what we treat as vector components (x, y, z, w) or color components (R, G, B, A) in the shaders we write, becomes separate values. But GPU instructions are still vector, as denoted by their prefix "v_". The SIMD in GPUs is used to process not a single vertex or pixel, but many of them (e.g. 64) at once. It means that a single register like v2 stores 64x 32-bit numbers and a single instruction like v_add_f32 adds per-component 64 of such numbers - just Xs or Ys or Zs or Ws, one for each pixel calculated in a separate SIMD lane.

Some people call it Structure of Arrays (SoA) as opposed to Array of Structures (AoS). This term comes from an imagination of how the data structure as stored in memory could be defined. If we were to define such data structure in C, the way we see it when programming in GLSL is array of structures:

struct {
  float x, y, z, w;
} number[64];

While the way the GPU actually operates is kind of a transpose of this - a structure of arrays:

struct {
  float x[64], y[64], z[64], w[64];
} number;

It comes with an interesting implication if you consider the condition we do before the addition. Please note that we write our shader as if we calculated just a single vertex or pixel, without even having to know that 64 of them will execute together in a vector manner. It means we have 64 Xs, Ys, Zs, and Ws. The X component of each pixel can be less or not less than 0, meaning that for each SIMD lane the condition may be fulfilled or not. So boolean variable needsIncrease inside the GPU is not a scalar, but also a vector, having 64 individual boolean values - one for each pixel! Each pixel may want to enter the if clause or skip it. That's what we call Single Instruction Multiple Threads (SIMT), and that's how real modern GPUs operate. How is it implemented if some threads want to do if and others want to do else? That's a different story...

Comments | #gpu #rendering Share

# Differences in memory management between Direct3D 12 and Vulkan

Jul 2019

Since July 2017 I develop Vulkan Memory Allocator (VMA) – a C++ library that helps with memory management in games and other applications using Vulkan. But because I deal with both Vulkan and DirectX 12 in my everyday work, I think it’s a good idea to compare them.

This is an article about a very specific topic. It may be useful to you if you are a programmer working with both graphics APIs – Direct3D 12 and Vulkan. These two APIs offer a similar set of features and performance. Both are the new generation, explicit, low-level interfaces to the modern graphics hardware (GPUs), so we could compare them back-to-back to show similarities and differences, e.g. in naming things. For example, ID3D12CommandQueue::ExecuteCommandLists function has Vulkan equivalent in form of vkQueueSubmit function. However, this article focuses on just one aspect – memory management, which means the rules and limitation of GPU memory allocation and the creation of resources – images (textures, render targets, depth-stencil surfaces etc.) and buffers (vertex buffers, index buffers, constant/uniform buffers etc.) Chapters below describe pretty much all the aspects of memory management that differ between the two APIs.

Read full article »

Comments | #vulkan #directx #gpu Share

# Programming FreeSync 2 support in Direct3D

Mar 2019

AMD just showed Oasis demo, presenting usage of its FreeSync 2 HDR technology. If you wonder how could you implement same features in your Windows DirectX program or game (it doesn’t matter if you use D3D11 or D3D12), here is an article for you.

But first, a disclaimer: Although I already put it on my “About” page, I’d like to stress that this is my personal blog, so all opinions presented here are my own and do not reflect that of my employer.

Radeon FreeSync (its new, official web page is here: Radeon™ FreeSync™ Technology | FreeSync™ 2 HDR Games) is an AMD technology that covers two different things, which may cause some confusion. First is variable refresh rate, second is HDR. Both of them need to be supported by a monitor. The database of FreeSync compatible monitors and their parameters is: Freesync Monitors.

Read full entry > | Comments | #gpu #directx #windows #graphics Share

Pages: 1 2 3 >

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2021