Tag: gpu

Entries for tag "gpu", ordered from most recent. Entry count: 24.

Pages: > 1 2 3 >

# How Do Graphics Cards Execute Vector Instructions?

Sun
19
Jan 2020

Intel announced that together with their new graphics architecture they will provide a new API, called oneAPI, that will allow to program GPU, CPU, and even FPGA in an unified way, and will support SIMD as well as SIMT mode. If you are not sure what does it mean but you want to be prepared for it, read this article. Here I try to explain concepts like SIMD, SIMT, AoS, SoA, and the vector instruction execution on CPU and GPU. I think it may interest to you as a programmer even if you don't write shaders or GPU computations. Also, don't worry if you don't know any assembly language - the examples below are simple and may be understandable to you, anyway. Below I will show three examples:

1. CPU, scalar

Let's say we write a program that operates on a numerical value. The value comes from somewhere and before we pass it for further processing, we want to execute following logic: if it's negative (less than zero), increase it by 1. In C++ it may look like this:

float number = ...;
bool needsIncrease = number < 0.0f;
if(needsIncrease)
 number += 1.0f;

If you compile this code in Visual Studio 2019 for 64-bit x86 architecture, you may get following assembly (with comments after semicolon added by me):

00007FF6474C1086 movss  xmm1,dword ptr [number]   ; xmm1 = number
00007FF6474C108C xorps  xmm0,xmm0                 ; xmm0 = 0
00007FF6474C108F comiss xmm0,xmm1                 ; compare xmm0 with xmm1, set flags
00007FF6474C1092 jbe    main+32h (07FF6474C10A2h) ; jump to 07FF6474C10A2 depending on flags
00007FF6474C1094 addss  xmm1,dword ptr [__real@3f800000 (07FF6474C2244h)]  ; xmm1 += 1
00007FF6474C109C movss  dword ptr [number],xmm1   ; number = xmm1
00007FF6474C10A2 ...

There is nothing special here, just normal CPU code. Each instruction operates on a single value.

2. CPU, vector

Some time ago vector instructions were introduced to CPUs. They allow to operate on many values at a time, not just a single one. For example, the CPU vector extension called Streaming SIMD Extensions (SSE) is accessible in Visual C++ using data types like __m128 (which can store 128-bit value representing e.g. 4x 32-bit floating-point numbers) and intrinsic functions like _mm_add_ps (which can add two such variables per-component, outputting a new vector of 4 floats as a result). We call this approach Single Instruction Multiple Data (SIMD), because one instruction operates not on a single numerical value, but on a whole vector of such values in parallel.

Let's say we want to implement following logic: given some vector (x, y, z, w) of 4x 32-bit floating point numbers, if its first component (x) is less than zero, increase the whole vector per-component by (1, 2, 3, 4). In Visual C++ we can implement it like this:

const float constant[] = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 number = ...;
float x; _mm_store_ss(&x, number);
bool needsIncrease = x < 0.0f;
if(needsIncrease)
 number = _mm_add_ps(number, _mm_loadu_ps(constant));

Which gives following assembly:

00007FF7318C10CA  comiss xmm0,xmm1  ; compare xmm0 with xmm1, set flags
00007FF7318C10CD  jbe    main+69h (07FF7318C10D9h)  ; jump to 07FF7318C10D9 depending on flags
00007FF7318C10CF  movaps xmm5,xmmword ptr [__xmm@(...) (07FF7318C2250h)]  ; xmm5 = (1, 2, 3, 4)
00007FF7318C10D6  addps  xmm5,xmm1  ; xmm5 = xmm5 + xmm1
00007FF7318C10D9  movaps xmm0,xmm5  ; xmm0 = xmm5

This time xmm registers are used to store not just single numbers, but vectors of 4 floats. A single instruction - addps (as opposed to addss used in the previous example) adds 4 numbers from xmm1 to 4 numbers in xmm5.

It may seem obvious, but it's important for future considerations to note that the condition here and the boolean variable driving it (needsIncrease) is not a vector, but a single value, calculated based on the first component of vector number. Such a single value in the SIMD world is also called a "scalar". Based on it, the condition is true or false and the branch is taken or not, so either the whole vector is increased by (1, 2, 3, 4), or nothing happens. This is how CPUs work, because we execute just one program, with one thread, which has one instruction pointer to execute its instructions sequentially.

3. GPU

Now let's move on from CPU world to the world of a graphic processor (GPU). Those are programmed in different languages. One of them is GLSL, used in OpenGL and Vulkan graphics APIs. In this language there is also a data type that holds 4x 32-bit floating-point numbers, called vec4. You can add a vector to a vector per-component using just '+' operator.

Same logic as in section 2. implemented in GLSL looks like this:

vec4 number = ...;
bool needsIncrease = number.x < 0.0;
if(needsIncrease)
 number += vec4(1.0, 2.0, 3.0, 4.0);

When you compile a shader with such code for an AMD GPU, you may see following GPU assembly: (For offline shader compilation I used Radeon GPU Analyzer (RGA) - free tool from AMD.)

v_add_f32      v5, 1.0, v2      ; v5 = v2 + 1
v_add_f32      v1, 2.0, v3      ; v1 = v3 + 2
v_cmp_gt_f32   vcc, 0, v2       ; compare v2 with 0, set flags
v_cndmask_b32  v2, v2, v5, vcc  ; override v2 with v5 depending on flags
v_add_f32      v5, lit(0x40400000), v4  ; v5 = v4 + 3
v_cndmask_b32  v1, v3, v1, vcc  ; override v1 with v3 depending on flags
v_add_f32      v3, 4.0, v0      ; v3 = v0 + 4
v_cndmask_b32  v4, v4, v5, vcc  ; override v4 with v5 depending on flags
v_cndmask_b32  v3, v0, v3, vcc  ; override v3 with v0 depending on flags

You can see something interesting here: Despite high level shader language is vector, the actual GPU assembly operates on individual vector components (x, y, z, w) using separate instructions and stores their values in separate registers like (v2, v3, v4, v0). Does it mean GPUs don't support vector instructions?!

Actually, they do, but differently. First GPUs from decades ago (right after they became programmable with shaders) really operated on those vectors in the way we see them. Nowadays, it's true that what we treat as vector components (x, y, z, w) or color components (R, G, B, A) in the shaders we write, becomes separate values. But GPU instructions are still vector, as denoted by their prefix "v_". The SIMD in GPUs is used to process not a single vertex or pixel, but many of them (e.g. 64) at once. It means that a single register like v2 stores 64x 32-bit numbers and a single instruction like v_add_f32 adds per-component 64 of such numbers - just Xs or Ys or Zs or Ws, one for each pixel calculated in a separate SIMD lane.

Some people call it Structure of Arrays (SoA) as opposed to Array of Structures (AoS). This term comes from an imagination of how the data structure as stored in memory could be defined. If we were to define such data structure in C, the way we see it when programming in GLSL is array of structures:

struct {
  float x, y, z, w;
} number[64];

While the way the GPU actually operates is kind of a transpose of this - a structure of arrays:

struct {
  float x[64], y[64], z[64], w[64];
} number;

It comes with an interesting implication if you consider the condition we do before the addition. Please note that we write our shader as if we calculated just a single vertex or pixel, without even having to know that 64 of them will execute together in a vector manner. It means we have 64 Xs, Ys, Zs, and Ws. The X component of each pixel can be less or not less than 0, meaning that for each SIMD lane the condition may be fulfilled or not. So boolean variable needsIncrease inside the GPU is not a scalar, but also a vector, having 64 individual boolean values - one for each pixel! Each pixel may want to enter the if clause or skip it. That's what we call Single Instruction Multiple Threads (SIMT), and that's how real modern GPUs operate. How is it implemented if some threads want to do if and others want to do else? That's a different story...

Comments | #gpu #rendering Share

# Differences in memory management between Direct3D 12 and Vulkan

Fri
26
Jul 2019

Since July 2017 I develop Vulkan Memory Allocator (VMA) – a C++ library that helps with memory management in games and other applications using Vulkan. But because I deal with both Vulkan and DirectX 12 in my everyday work, I think it’s a good idea to compare them.

This is an article about a very specific topic. It may be useful to you if you are a programmer working with both graphics APIs – Direct3D 12 and Vulkan. These two APIs offer a similar set of features and performance. Both are the new generation, explicit, low-level interfaces to the modern graphics hardware (GPUs), so we could compare them back-to-back to show similarities and differences, e.g. in naming things. For example, ID3D12CommandQueue::ExecuteCommandLists function has Vulkan equivalent in form of vkQueueSubmit function. However, this article focuses on just one aspect – memory management, which means the rules and limitation of GPU memory allocation and the creation of resources – images (textures, render targets, depth-stencil surfaces etc.) and buffers (vertex buffers, index buffers, constant/uniform buffers etc.) Chapters below describe pretty much all the aspects of memory management that differ between the two APIs.

Read full article »

Comments | #vulkan #directx #gpu Share

# Programming FreeSync 2 support in Direct3D

Sat
02
Mar 2019

AMD just showed Oasis demo, presenting usage of its FreeSync 2 HDR technology. If you wonder how could you implement same features in your Windows DirectX program or game (it doesn’t matter if you use D3D11 or D3D12), here is an article for you.

But first, a disclaimer: Although I already put it on my “About” page, I’d like to stress that this is my personal blog, so all opinions presented here are my own and do not reflect that of my employer.

Radeon FreeSync (its new, official web page is here: Radeon™ FreeSync™ Technology | FreeSync™ 2 HDR Games) is an AMD technology that covers two different things, which may cause some confusion. First is variable refresh rate, second is HDR. Both of them need to be supported by a monitor. The database of FreeSync compatible monitors and their parameters is: Freesync Monitors.

Read full entry > | Comments | #gpu #directx #windows #graphics Share

# Programming HDR monitor support in Direct3D

Wed
27
Feb 2019

I got an HDR supporting monitor (LG 32GK850F), so I started learning how I can use its capabilities programatically. I still have much to learn, as there is a lot of theory to be ingested about color spaces etc., but in this blog post I’d like to go straight to the point: How to enable HDR in your C++ DirectX program? To test this, I used 3 graphics chips from 3 different PC GPU vendors. Below you can see results of my experiments.

Read full entry > | Comments | #graphics #windows #directx #gpu Share

# How to design API of a library for Vulkan?

Fri
08
Feb 2019

In my previous blog post yesterday, I shared my thoughts on graphics APIs and libraries. Another problem that brought me to these thoughts is a question: How do you design an API for a library that implements a single algorithm, pass, or graphics effect, using Vulkan or DX12? It may seem trivial at first, like a task that just needs to be designed and implemented, but if you think about it more, it turns out to be a difficult issue. They are few software libraries like this in existence. I don’t mean here a complex library/framework/engine that “horizontally” wraps the entire graphics API and takes it to a higher level, like V-EZ, Nvidia Falcor, or Google Filament. I mean just a small, “vertical”, plug-in library doing one thing, e.g. implementing ambient occlusion effect, efficient texture mipmap down-sampling, rendering UI, or simulating particle physics on the GPU. Such library needs to interact efficiently with the rest of the user’s code to be part of a large program or game. Vulkan Memory Allocator is also not a good example of this, because it only manages memory, implements no render passes, involves no shaders, and it interacts with a command buffer only in its part related to memory defragmentation.

I met this problem at my work. Later I also discussed it in details with my colleague. There are multiple questions to consider:

This is a problem similar to what we have with any C++ libraries. There is no consensus about the implementation of various basic facilities, like strings, containers, asserts, mutexes etc., so every major framework or game engine implements its own. Even something so simple as min/max function is defined is multiple places. It is defined once in <algorithm> header, but some developers don’t use STL. <Windows.h> provides its own, but these are defined as macros, so they break any other, unless you #define NOMINMAX before the include… A typical C++ nightmare. Smaller libraries are better just configurable or define their own everything, like the Vulkan Memory Allocator having its own assert, vector (can be switched to standard STL one), and 3 versions of read-write mutex.

All these issues make it easier for developers to just write a paper, describe their algorithm, possibly share a piece of code, pseudo-code or a shader, rather than provide ready to use library. This is a very bad situation. I hope that over time patterns emerge of how the API of a library implementing a single pass or effect using Vulkan/DX12 should look like. Recently my colleague shared an idea with me that if there was some higher-level API that would implement all these interactions between various parts (like resource allocation, image barriers) and we all commonly agreed on using it, then authoring libraries and stitching them together on top of it would be way easier. That’s another argument for the need of such new, higher-level graphics API.

Comments | #gpu #vulkan #directx #libraries #graphics #c++ Share

# Thoughts on graphics APIs and libraries

Thu
07
Feb 2019

Warning: This is a long rant. I’d like to share my personal thoughts and opinions on graphics APIs like Vulkan, Direct3D 12.

Some time ago I came up with a diagram showing how the graphics software technologies evolved over last decades – see my blog post “Lower-Level Graphics API - What Does It Mean?”. The new graphics APIs (Direct3D 12, Vulkan, Metal) are not only a clean start, so they abandon all the legacy garbage going back to ‘90s (like glVertex), but they also take graphics programming to a new level. It is a lower level – they are more explicit, closer to the hardware, and better match how modern GPUs work. At least that’s the idea. It means simpler, more efficient, and less error-prone drivers. But they don’t make the game or engine programming simpler. Quite the opposite – more responsibilities are now moved to engine developers (e.g. memory management/allocation). Overall, it is commonly considered a good thing though, because the engine has higher-level knowledge of its use cases (e.g. which textures are critically important and which can be unloaded when GPU memory is full), so it can get better performance by doing it properly. All this is hidden in the engines anyway, so developers making their games don’t notice the difference.

Those of you, who – just like me – deal with those low-level graphics APIs in their everyday work, may wonder if these APIs provide the right level of abstraction. I know it will sound controversial, but sometimes I get a feeling they are at the exactly worst possible level – so low they are difficult to learn and use properly, while so high they still hide some implementation details important for getting a good performance. Let’s take image/texture barriers as an example. They were non-existent in previous APIs. Now we have to do them, which is a major pain point when porting old code to a new API. Do too few of them and you get graphical corruptions on some GPUs and not on the others. Do too many and your performance can be worse than it has been on DX11 or OGL. At the same time, they are an abstract concept that still hides multiple things happening under the hood. You can never be sure which barrier will flush some caches, stall the whole graphics pipeline, or convert your texture between internal compression formats on a specific GPU, unless you use some specialized, vendor-specific profiling tool, like Radeon GPU Profiler (RGP).

It’s the same with memory. In DX11 you could just specify intended resource usage (D3D11_USAGE_IMMUTABLE, D3D11_USAGE_DYNAMIC) and the driver chose preferred place for it. In Vulkan you have to query for memory heaps available on the current GPU and explicitly choose the one you decide best for your resource, based on low-level flags like VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT etc. AMD exposes 4 memory types and 3 memory heaps. Nvidia has 11 types and 2 heaps. Intel integrated graphics exposes just 1 heap and 2 types, showing the memory is really unified, while AMD APU, also integrated, has same memory model as the discrete card. If you try to match these to what you know about physically existing video RAM and system RAM, it doesn’t make any sense. You could just pick the first DEVICE_LOCAL memory for the fastest GPU access, but even then, you cannot be sure your resource will stay in video RAM. It may be silently migrated to system RAM without your knowledge and consent (e.g. if you go out of memory), which will degrade performance. What is more, there is no way to query for the amount of free GPU memory in Vulkan, unless you do hacks like using DXGI.

Hardware queues are no better. Vulkan claims to give explicit access to the pieces of GPU hardware, so you need to query for queues that are available. For example, Intel exposes only a single graphics queue. AMD lets you create up to 3 additional compute-only queues and 2 transfer queues. Nvidia has 8 compute queues and 1 transfer queue. Do they all really map to silicon that can work in parallel? I doubt it. So how many of them to use to get the best performance? There is no way to tell by just using Vulkan API. AMD promotes doing compute work in parallel with 3D rendering while Nvidia diplomatically advises to be “conscious” with it.

It's the same with presentation modes. You have to enumerate VkPresentModeKHR-s available on the machine and choose the right one, along with number of images in the swapchain. These don't map intuitively to a typical user-facing setting of V-sync = on/off, as they are intended to be low level. Still you have no control and no way to check whether the driver does "blit" or "flip".

One could say the new APIs don’t deliver to their promise of being low level, explicit, and having predictable performance. It is impossible to deliver, unless the API is specific to one GPU, like there is on consoles. A common API over different GPUs is always high level, things happen under the hood, and there are still fast and slow paths. Isn’t all this complexity just for nothing? It may be true that comparing to previous generation APIs, drivers for the new ones need not launch additional threads in the background or perform shader compilation on first draw call, which greatly reduces chances of major hitching. (We will see how long this state will persist as the APIs and drivers evolve.) * Still there is no way to predict or ensure minimum FPS/maximum frame time. We are talking about systems where multiple processes compete for resources. On modern PCs there is even no way to know how many cycles will a single instruction take! Cache memory, branch prediction, out-of-order execution – all of these mechanisms are there in the CPU to speed up average cases, but there can always be cases when it works slowly (e.g. cache miss). It’s the same with graphics. I think we should abandon the false hope of predictable performance as a thing of the past, just like rendering graphics pixel-perfect. We can optimize for the average, but we cannot ensure the minimum. After all, games are “soft real-time systems”.

Based on that, I am thinking if there is a room for a new graphics API or top of DX12 or Vulkan. I don’t mean whole game engine with physical simulation, handling sound, input controllers and all, like Unity or UE4. I mean an API just like DX11 or OGL, on a similar or higher abstraction level (if higher level, maybe the concept of persistent “frame graph” with explicit pass and resource dependencies is the way to go?). I also don’t think it’s enough to just reimplement any of those old APIs. The new one should take advantage of features of the explicit APIs (like parallel command buffer recording), while hiding the difficult parts (e.g. queues, memory types, descriptors, barriers), so it’s easier to use and harder to misuse. (An existing library similar to this concept is V-EZ from AMD.) I think it may still have good performance. The key thing needed for creation of such library is abandoning the assumption that developer must define everything up-front, with nothing allocated, created, or transferred on first use.

See also next post: "How to design API of a library for Vulkan?"

Update 2019-02-12: I want to thank all of you for the amazing feedback I received after publishing this post, especially on Twitter. Many projects have been mentioned that try to provide an API better than Vulkan or DX12 - e.g. Apple Metal, WebGPU, The Forge by Confetti.

* Update 2019-04-16: Microsoft just announced they are adding background shader optimizations to D3D12, so driver can recompile and optimize shaders in the background on its own threads. Congratulations! We are back at D3D11 :P

Update 2021-04-01: Same with pipeline states. In the old days, settings used to be independent, enabled using glEnable or ID3D9Device::SetRenderState. New APIs promised to avoid "non-orthogonal states" - having to recompile shaders on a new draw call (which caused a major hitch) by enclosing most of the states in a Pipeline (State Object). But they went too far and require a new PSO every time we want to change something simple which almost certainly doesn't go to shader code, like stencil write mask. That created new class of problems - having to create thousands of PSOs during loading (which can take minutes), necessity for shader caches, pipeline caches etc. Vulkan loosened these restrictions by offering "dynamic state" and later extended that with VK_EXT_extended_dynamic_state extension. So we are back, with just more complex API to handle :P

Comments | #gpu #optimization #graphics #directx #libraries #vulkan Share

# Vulkan API - my talk at Warsaw University of Technology

Mon
16
Apr 2018

On Wednesday 16 April, around 8 PM, at Warsaw University of Technology, during weekly meeting of KNTG Polygon, I will give a talk about "Vulkan API" (in Polish). Come if you want to hear about new generation of graphics APIs, see how Vulkan API looks like, what tools are there to support it, what are advantages and disadvantages of using such API and finally decide whethere learning Vulkan is a good idea for you.

Event on Facebook: https://www.facebook.com/events/185314825611839/

Slides:
Vulkan API.pdf
Vulkan API.pptx

Comments | #graphics #gpu #vulkan #teaching Share

# Switchable graphics versus D3D11 adapters

Sat
24
Feb 2018

When you have a laptop with so called "switchable graphics" (like I do in my Lenovo IdeaPad G50-80), you effectively have two GPUs. In my case, these are: integrated Intel i7-5500U and AMD Radeon R5 M330. While programming in DirectX 11, you can enumerate these two adapters and choose any of them while creating a ID3D11Device object. For quite some time I was wondering how various settings of this "switchable graphics" affect my app? Today I finally figured it out. Long story short: They just change order of these adapters as visible to my program, so that the appropriate one is visible as adapter 0. Here is the full story:

It looks like the base setting is the one that can be found in Windows Settings > Power options > edit your power plan > Switchable Dynamic Graphics. (Not to confuse with "AMD Graphics Power Settings"!) When you set it to "Optimize power savings" or "Optimize performance", application sees Intel GPU as first adapter:

When you choose "Maximize performance", application sees AMD GPU as first adapter:

I also found that Radeon Settings (the app that comes with AMD graphics driver) overrides this system setting. If you go to System > Switchable Graphics and make configuration for your specific executable, then again: choosing "Power Saving" makes your app see Intel GPU as first adapter, while choosing "High Performance" makes AMD graphics first.

It's as simple as that. Basically if you always use the first adapter you find, then you follow recommended settings of the system. You are still free to use the other adapter while creating your D3D11 device. I checked that - it works and it really uses that one.

It's especially important if you meet a strange bug where your app hangs on one of these GPUs.

Update 2018-05-02: Microsoft plans to add an API for enumerating adapters based on a given GPU preference (minimum power or high performance). See IDXGIFactory6::EnumAdapterByGpuPreference.

Update 2018-08-23: See also related article: Selecting the Best Graphics Device to Run a 3D Intensive Application - GPUOpen.

Update 2020-07-09: I've heard that on desktop PCs the behavior of adapter enumeration may be different than on laptops - the first one may be the one which has the monitor connected to it.

Comments | #gpu #directx Share

Pages: > 1 2 3 >

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2022