Tag: optimization

Entries for tag "optimization", ordered from most recent. Entry count: 9.

Warning! Some information on this page is older than 5 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Pages: 1 2 >

# A Better Way to Scalarize a Shader

Tue
20
Oct 2020

This will be an advanced article. It assumes you not only know how to write shaders but also how they work on a low level (like vector versus scalar registers) and how to optimize them using scalarization. It all starts from a need to index into an array of texture or buffer descriptors, where the index is dynamic – it may vary from pixel to pixel. This is useful e.g. when doing bindless-style rendering or blending various layers of textures e.g. on a terrain. To make it working properly in a HLSL shader, you need to surround the indexing operation with a pseudo-function NonUniformResourceIndex. See also my old blog post “Direct3D 12 - Watch out for non-uniform resource index!”.

Texture2D g_Textures[] : register(t1);
...
return g_Textures[NonUniformResourceIndex(textureIndex)].Load(pos);

In many cases, it is enough. The driver will do its magic to make things working properly. But if your logic dependent on textureIndex is more complex than a single Load or SampleGrad, e.g. you sample multiple textures or do some calculations (let's call it MyDynamicTextureIndexing), then it might be beneficial to scalarize the shader manually using a loop and wave functions from HLSL Shader Model 6.0.

I learned how to do scalarization from the 2-part article “Intro to GPU Scalarization” by Francesco Cifariello Ciardi and the presentation “Improved Culling for Tiled and Clustered Rendering” by MichaƂ Drobot, linked from it. Both sources propose an implementation like the following HLSL snippet:

// WORKING, TRADITIONAL
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
while(any(currThreadMask & activeThreadsMask) != 0)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   uint2 scalarTextureIndexThreadMask = WaveActiveBallot(scalarTextureIndex == textureIndex).xy;
   activeThreadsMask &= ~scalarTextureIndexThreadMask;
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
   }
}
return color;

It involves a bit mask of active threads. From the moment I first saw this code, I started wondering: Why is it needed? A mask of threads that still want to continue spinning the loop is already maintained implicitly by the shader compiler. Couldn't we just break; from the loop when done with the textureIndex of the current thread?! So I wrote this short piece of code:

// BAD, CRASHES
float4 color = float4(0.0, 0.0, 0.0, 0.0);
while(true)
{
   uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
   [branch]
   if(scalarTextureIndex == textureIndex)
   {
       color = MyDynamicTextureIndexing(textureIndex);
       break;
   }
}
return color;

…and it crashed my GPU. At first I thought it may be a bug in the shader compiler, but then I recalled footnote [2] in part 2 of the scalarization tutorial, which mentions an issue with helper lanes. Let me elaborate on this. When a shader is executed in SIMT fashion, individual threads (lanes) may be active or inactive. Active lanes are these that do their job. Inactive lanes may be inactive from the very beginning because we are at the edge of a triangle so there are not enough pixels to make use of all the lanes or may be disabled temporarily because e.g. we are executing an if section that some threads didn't want to enter. But in pixel shaders there is a third kind of lanes – helper lanes. These are used instead of inactive lanes to make sure full 2x2 quads always execute the code, which is needed to calculate derivatives ddx/ddy, also done explicitly when sampling a texture to calculate the correct mip level. A helper lane executes the code (like an active lane), but doesn't export its result to the render target (like an inactive lane).

As it turns out, helper lanes also don't contribute to wave functions – they work like inactive lanes. Can you already see the problem here? In the loop shown above, it may happen than a helper lane has its textureIndex different from any active lanes within a wave. It will then never get its turn to process it in a scalar fashion, so it will fall into an infinite loop, causing GPU crash (TDR)!

Then I thought: What if I disable helper lanes just once, before the whole loop? So I came up with the following shader. It seems to work fine. I also think it is better than the first solution, as it operates on the thread bit mask only once at the beginning and so uses fewer variables to be stored in GPU registers and does fewer calculations in every loop iteration. Now I'm thinking whether there is something wrong with my idea that I can't see now? Or did I just invent a better way to scalarize shaders?

// WORKING, NEW
float4 color = float4(0.0, 0.0, 0.0, 0.0);
uint currThreadIndex = WaveGetLaneIndex();
uint2 currThreadMask = uint2(
   currThreadIndex < 32 ? 1u << currThreadIndex : 0,
   currThreadIndex < 32 ? 0 : 1u << (currThreadIndex - 32));
uint2 activeThreadsMask = WaveActiveBallot(true).xy;
[branch]
if(any((currThreadMask & activeThreadsMask) != 0))
{
   while(true)
   {
       uint scalarTextureIndex = WaveReadLaneFirst(textureIndex);
       [branch]
       if(scalarTextureIndex == textureIndex)
       {
           color = MyDynamicTextureIndexing(textureIndex);
           break;
       }
   }
}
return color;

UPDATE 2020-10-28: There are some valuable comments under my tweet about this topic that I recommend to check out.

Comments | #optimization #directx #gpu Share

# Which Values Are Scalar in a Shader?

Wed
14
Oct 2020

GPUs are highly parallel processors. Within one draw call or a compute dispatch there might be thousands or millions of invocations of your shader. Some variables in such a shader have constant value for all invocations in the draw call / dispatch. We can call them constant or uniform. A literal constant like 23.0 is surely such a value and so is a variable read from a constant (uniform) buffer, let’s call it cbScaleFactor, or any calculation on such data, like (cbScaleFactor.x + cbScaleFactor.y) * 2.0 - 1.0.

Other values may vary from thread to thread. These will surely be vertex attributes, as well as system value semantics like SV_Position in a pixel shader (denoting the position of the current pixel on the screen), SV_GroupThreadID in a compute shader (identifier of the current thread within a thread group), and any calculations based on them. For example, sampling a texture using non-constant UV coordinates will result in a non-constant color value.

But there is another level of grouping of threads. GPU cores (Compute Units, Execution Units, CUDA Cores, however we call them) execute a number of threads at once in a SIMD fashion. It would be more correctly to say SIMT. For the explanation of the difference see my old post: “How Do Graphics Cards Execute Vector Instructions?” It’s usually something like 8, 16, 32, 64 threads executing on one core, together called a wave in HLSL and a subgroup in GLSL.

Normally you don’t need to care about this fact. However, recent versions of HLSL and GLSL added intrinsic functions that allow to exchange data between lanes (threads/invocations within a wave/subgroup) - see “HLSL Shader Model 6.0” or “Vulkan Subgroup Tutorial”. Using them may allow to optimize shader performance.

This another level of grouping yields a possibility for a variable to be or not to be uniform (to have the same value) across a single wave, even if it’s not constant across the entire draw call or dispatch. We can also call it scalar, as it tends to go to scalar registers (SGPRs) rather than vector registers (VGPRs) on AMD architecture, which is overall good for performance. Simple cases like the ones I mentioned above still apply. What’s constant across the entire draw call is also scalar within a wave. What varies from thread to thread is not scalar. Some wave functions like WaveReadLaneFirst, WaveActiveMax, WaveActiveAllTrue return the same value for all threads, so it’s always scalar.

Knowing which values are scalar and which ones may not be is necessary in some cases. For example, indexing buffer or texture array requires special keyword NonUniformResourceIndex if the index is not uniform across the wave. I warned about it in my blog post “Direct3D 12 - Watch out for non-uniform resource index!”. Back then I was working on shader compiler at Intel, helping to finish DX12 implementation before the release of Windows 10. Now, 5 years later, it is still a tricky thing to get right.

Another such case is a function WaveReadLaneAt which “returns the value of the expression for the given lane index within the specified wave”. The index of the lane to fetch was required to be scalar, but developers discovered it actually works fine to use a dynamically varying value for it, like Ken Hu in his blog post “HLSL pitfalls”. Now Microsoft formally admitted that it is working and allowed LaneIndex to be any value by making this GitHub commit to their documentation.

If this is so important to know where an argument needs to be scalar and which values are scalar, you should also know about some less obvious, tricky ones.

SV_GroupID in compute shader – identifier of the group within a compute dispatch. This one surely is uniform across the wave. I didn’t search specifications for this topic, but it seems obvious that if a groupshared memory is private to a thread group and a synchronization barrier can be issued across a thread group, threads from different groups cannot be assigned to a single wave. Otherwise everything would break.

SV_InstanceID in vertex shader – index of an instance within an instanced draw call. It looks similar, but the answer is actually opposite. I’ve seen discussions about it many times. It is not guaranteed anywhere that threads in one wave will calculate vertices of the same instance. While inconvenient for those who would like to optimize their vertex shader using wave functions, it actually gives a graphics driver an opportunity to increase utilization by packing vertices from multiple instances into one wave.

SV_GroupThreadID.xyz in compute shader – identifier of the thread within a thread group in a particular dimension. Article “Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2” on GPUOpen.com suggests that by using [numthreads(64,2,1)], you can be sure that waves will be dispatched as 32x1x1 or 64x1x1, so that SV_GroupThreadID.y will be scalar across a wave. It may be true for AMD architecture and other GPUs currently on the market, so relying on this may be a good optimization opportunity on consoles with a known fixed hardware, but it is not formally correct to assume this on any PC. Neither D3D nor Vulkan specification says that threads from a compute thread group are assigned to waves in row-major order. The order is undefined, so theoretically a driver in a new version may decide to spawn waves of 16x2x1. It is also not guaranteed that some mysterious new GPU couldn’t appear in the future that is 128-lane wide. WaveGetLaneCount function says “the result will be between 4 and 128”. Such GPU would execute entire 64x2x1 group as a single wave. In both cases, SV_GroupThreadID.y wouldn’t be scalar.

Long story short: Unless you can prove otherwise, always assume that what is not uniform (constant) across the entire draw call or dispatch is also not uniform (scalar) across the wave.

Comments | #gpu #directx #vulkan #optimization Share

# Thoughts on graphics APIs and libraries

Thu
07
Feb 2019

Warning: This is a long rant. I’d like to share my personal thoughts and opinions on graphics APIs like Vulkan, Direct3D 12.

Some time ago I came up with a diagram showing how the graphics software technologies evolved over last decades – see my blog post “Lower-Level Graphics API - What Does It Mean?”. The new graphics APIs (Direct3D 12, Vulkan, Metal) are not only a clean start, so they abandon all the legacy garbage going back to ‘90s (like glVertex), but they also take graphics programming to a new level. It is a lower level – they are more explicit, closer to the hardware, and better match how modern GPUs work. At least that’s the idea. It means simpler, more efficient, and less error-prone drivers. But they don’t make the game or engine programming simpler. Quite the opposite – more responsibilities are now moved to engine developers (e.g. memory management/allocation). Overall, it is commonly considered a good thing though, because the engine has higher-level knowledge of its use cases (e.g. which textures are critically important and which can be unloaded when GPU memory is full), so it can get better performance by doing it properly. All this is hidden in the engines anyway, so developers making their games don’t notice the difference.

Those of you, who – just like me – deal with those low-level graphics APIs in their everyday work, may wonder if these APIs provide the right level of abstraction. I know it will sound controversial, but sometimes I get a feeling they are at the exactly worst possible level – so low they are difficult to learn and use properly, while so high they still hide some implementation details important for getting a good performance. Let’s take image/texture barriers as an example. They were non-existent in previous APIs. Now we have to do them, which is a major pain point when porting old code to a new API. Do too few of them and you get graphical corruptions on some GPUs and not on the others. Do too many and your performance can be worse than it has been on DX11 or OGL. At the same time, they are an abstract concept that still hides multiple things happening under the hood. You can never be sure which barrier will flush some caches, stall the whole graphics pipeline, or convert your texture between internal compression formats on a specific GPU, unless you use some specialized, vendor-specific profiling tool, like Radeon GPU Profiler (RGP).

It’s the same with memory. In DX11 you could just specify intended resource usage (D3D11_USAGE_IMMUTABLE, D3D11_USAGE_DYNAMIC) and the driver chose preferred place for it. In Vulkan you have to query for memory heaps available on the current GPU and explicitly choose the one you decide best for your resource, based on low-level flags like VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT etc. AMD exposes 4 memory types and 3 memory heaps. Nvidia has 11 types and 2 heaps. Intel integrated graphics exposes just 1 heap and 2 types, showing the memory is really unified, while AMD APU, also integrated, has same memory model as the discrete card. If you try to match these to what you know about physically existing video RAM and system RAM, it doesn’t make any sense. You could just pick the first DEVICE_LOCAL memory for the fastest GPU access, but even then, you cannot be sure your resource will stay in video RAM. It may be silently migrated to system RAM without your knowledge and consent (e.g. if you go out of memory), which will degrade performance. What is more, there is no way to query for the amount of free GPU memory in Vulkan, unless you do hacks like using DXGI.

Hardware queues are no better. Vulkan claims to give explicit access to the pieces of GPU hardware, so you need to query for queues that are available. For example, Intel exposes only a single graphics queue. AMD lets you create up to 3 additional compute-only queues and 2 transfer queues. Nvidia has 8 compute queues and 1 transfer queue. Do they all really map to silicon that can work in parallel? I doubt it. So how many of them to use to get the best performance? There is no way to tell by just using Vulkan API. AMD promotes doing compute work in parallel with 3D rendering while Nvidia diplomatically advises to be “conscious” with it.

It's the same with presentation modes. You have to enumerate VkPresentModeKHR-s available on the machine and choose the right one, along with number of images in the swapchain. These don't map intuitively to a typical user-facing setting of V-sync = on/off, as they are intended to be low level. Still you have no control and no way to check whether the driver does "blit" or "flip".

One could say the new APIs don’t deliver to their promise of being low level, explicit, and having predictable performance. It is impossible to deliver, unless the API is specific to one GPU, like there is on consoles. A common API over different GPUs is always high level, things happen under the hood, and there are still fast and slow paths. Isn’t all this complexity just for nothing? It may be true that comparing to previous generation APIs, drivers for the new ones need not launch additional threads in the background or perform shader compilation on first draw call, which greatly reduces chances of major hitching. (We will see how long this state will persist as the APIs and drivers evolve.) * Still there is no way to predict or ensure minimum FPS/maximum frame time. We are talking about systems where multiple processes compete for resources. On modern PCs there is even no way to know how many cycles will a single instruction take! Cache memory, branch prediction, out-of-order execution – all of these mechanisms are there in the CPU to speed up average cases, but there can always be cases when it works slowly (e.g. cache miss). It’s the same with graphics. I think we should abandon the false hope of predictable performance as a thing of the past, just like rendering graphics pixel-perfect. We can optimize for the average, but we cannot ensure the minimum. After all, games are “soft real-time systems”.

Based on that, I am thinking if there is a room for a new graphics API or top of DX12 or Vulkan. I don’t mean whole game engine with physical simulation, handling sound, input controllers and all, like Unity or UE4. I mean an API just like DX11 or OGL, on a similar or higher abstraction level (if higher level, maybe the concept of persistent “frame graph” with explicit pass and resource dependencies is the way to go?). I also don’t think it’s enough to just reimplement any of those old APIs. The new one should take advantage of features of the explicit APIs (like parallel command buffer recording), while hiding the difficult parts (e.g. queues, memory types, descriptors, barriers), so it’s easier to use and harder to misuse. (An existing library similar to this concept is V-EZ from AMD.) I think it may still have good performance. The key thing needed for creation of such library is abandoning the assumption that developer must define everything up-front, with nothing allocated, created, or transferred on first use.

See also next post: "How to design API of a library for Vulkan?"

Update 2019-02-12: I want to thank all of you for the amazing feedback I received after publishing this post, especially on Twitter. Many projects have been mentioned that try to provide an API better than Vulkan or DX12 - e.g. Apple Metal, WebGPU, The Forge by Confetti.

* Update 2019-04-16: Microsoft just announced they are adding background shader optimizations to D3D12, so driver can recompile and optimize shaders in the background on its own threads. Congratulations! We are back at D3D11 :P

Update 2021-04-01: Same with pipeline states. In the old days, settings used to be independent, enabled using glEnable or ID3D9Device::SetRenderState. New APIs promised to avoid "non-orthogonal states" - having to recompile shaders on a new draw call (which caused a major hitch) by enclosing most of the states in a Pipeline (State Object). But they went too far and require a new PSO every time we want to change something simple which almost certainly doesn't go to shader code, like stencil write mask. That created new class of problems - having to create thousands of PSOs during loading (which can take minutes), necessity for shader caches, pipeline caches etc. Vulkan loosened these restrictions by offering "dynamic state" and later extended that with VK_EXT_extended_dynamic_state extension. So we are back, with just more complex API to handle :P

Comments | #gpu #optimization #graphics #directx #libraries #vulkan Share

# Efficient way of using std::vector

Sat
22
Sep 2018

Some people say that C++ STL is slow. I would rather say it's the way we use it. Sure, there are many potential sources of slowness when using STL. For example, std::list or std::map tends to allocate many small objects and dynamic allocation is a time-consuming operation. Making many copies of your objects like std::string is also costly - that's why I created str_view project. But std::vector is just a wrapper over a dynamically allocated array that also remembers its size and can reallocate when you add new elements. Without STL, you would need to implement the same functionality yourself.

When it comes to traversing elements of the vector (e.g. to sum the numerical values contained in it), there are many ways to do it. STL is notoriously known for working very slow in Debug project configuration, but as it turns out, this heavily depends on what method do you choose for using it.

Here is a small experiment that I've just made. In this code, I create a vector of 100,000,000 integers, then sum its elements using 5 different methods, calculating how much time does it take for each of them. Results (averaged over 5 iterations for each method) are as follows. Notice logarithmic scale on horizontal axis.

Here is the full source code of my testing program:

#include <cstdio>
#include <cstdint>
#include <vector>
#include <chrono>
#include <numeric>

typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::high_resolution_clock::duration duration;
inline time_point now() { return std::chrono::high_resolution_clock::now(); }
inline double durationToMilliseconds(duration d) { return std::chrono::duration<double, std::milli>(d).count(); }

int main()
{
    printf("Iteration,Method,Sum,Time (ms)\n");
    
    for(uint32_t iter = 0; iter < 5; ++iter)
    {
        std::vector<int> numbers(100000000ull);
        numbers[0] = 1; numbers[1] = 2; numbers.back() = 3;

        {
            time_point timeBeg = now();

            // Method 1: Use STL algorithm std::accumulate.
            int sum = std::accumulate(numbers.begin(), numbers.end(), 0);

            printf("%u,accumulate,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 2: Use the new C++11 range-based for loop.
            int sum = 0;
            for(auto value : numbers)
                sum += value;

            printf("%u,Range-based for loop,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 3: Use traditional loop, traverse vector using its iterator.
            int sum = 0;
            for(auto it = numbers.begin(); it != numbers.end(); ++it)
                sum += *it;

            printf("%u,Loop with iterator,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 4: Use traditional loop, traverse using index.
            int sum = 0;
            for(size_t i = 0; i < numbers.size(); ++i)
                sum += numbers[i];

            printf("%u,Loop with indexing,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 5: Get pointer to raw array and its size, then use a loop to traverse it.
            int sum = 0;
            int* dataPtr = numbers.data();
            size_t count = numbers.size();
            for(size_t i = 0; i < count; ++i)
                sum += dataPtr[i];

            printf("%u,Loop with pointer,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }
    }
}

As you can see, some methods are slower than the others in Debug configurations by more than 3 orders of magnitude! The difference is so big that if you write your program or game like this, it may not be possible to use its Debug version with any reasonably-sized input data. But if you look at disassembly, it should be no surprise. For example, method 4 calls vector methods size() and operator[] in every iteration of the loop. We know that in Debug configuration functions are not inilined and optimized, so these are real function calls:

On the other hand, method 5 that operates on raw pointer to the vector's underlying data is not that much slower in Debug configuration comparing to Release. Disassembly from Debug version:

So my conclusion is: Using std::vector to handle memory management and reallocation and using raw pointer to access its data is the best way to go.

My testing environment was:

CPU: Intel Core i7-6700K 4.00 GHz
RAM: DDR4, Dual-Channel, current memory clock 1066 MHz
OS: Windows 10 Version 1803 (OS Build 17134.285)
Compiler: Microsoft Visual Studio Community 2017 Version 15.4.8
Configuration options: x64 Debug/Release
Windows SDK Version 10.0.16299.0

Comments | #stl #c++ #optimization Share

# Rendering Optimization - My Talk at Warsaw University of Technology

Tue
12
Dec 2017

If you happen to be in Warsaw tomorrow (2017-12-13), I'd like to invite you to my talk at Warsaw University of Technology. On the weekly meeting of Polygon group, this time the first talk will be about about artificial intelligence in action games (by Kacpi), followed by mine about rendering optimization. It will be technical, but I think it should be quite easy to understand. I won't show a single line of code. I will just give some tips for getting good performance when rendering 3D graphics on modern GPUs. I will also show some tools that can help with performance profiling. It will be all in Polish. The event starts at 7 p.m. Entrance is free. See also Facebook event. Traditionally after the talks we all go for a beer :)

Comments | #teaching #graphics #gpu #optimization Share

# When QueryPerformanceCounter call takes long time

Sun
03
Dec 2017

QueryPerformanceCounter function is all about measuring time and profiling performance, so I wasn't able to formulate right Google query to find a solution to the problem I had - call to QueryPerformanceCounter function itself taking too much time. Below I describe what I eventually found out.

It all started from hardware failure. My motherboard stopped working, so I needed to buy a new one (ASRock X370 Killer SLI). I know that normally changing motherboard requires reinstalling Windows, but I tried not to do it. The system didn't want to boot, so I booted the PC using pendrive with Windows installer and launched the repair function. It helped - after that Windows was able to start and everything seemed to work... until I launched the program that I develop on that machine. It was running painfully slow.

I tried different things to find out what is happening. Input/output to hard drive or network was not an issue. GPU performance was also OK. It seemed that the app is just doing its calculations slowly, like the CPU was very slow. I double-checked actual CPU and RAM frequency, but it was OK. Finally I launched sampling profiler (the one embedded in Visual Studio - command: Analyze > Performance Profiler). This way I found that most of the time is spent in function QueryPerformanceCounter.

This WinAPI function is the recommended way to obtain a timestamp in Windows. It's very precise, monotonic, safe to use on multiple cores and threads, it has stable frequency independent of CPU power management or Turbo Boost... It's just great, but in order to meet all these requirements, Windows may use different methods to implement it, as described in article Acquiring high-resolution time stamps. Some of them are fast (just reading TSC register), others are slow (require system call - transition to kernel mode).

I wrote a simple C++ program that tests how long it takes to execute QueryPerformanceCounter function. You can see the code here: QueryPerformanceCounterTest.cpp and download 64-bit binary here: QueryPerformanceCounterTest.zip. Running this test on two different machines gave following results:

CPU: Intel Core i7-6700K, Motherboard: GIGABYTE Z170-HD3-CF:

> QueryPerformanceCounterTest.exe 1000000000
Executing QueryPerformanceCounter x 1000000000...
According to GetTickCount64 it took 0:00:11.312 (11.312 ns per call)
According to QueryPerformanceCounter it took 0:00:11.314 (11.314 ns per call)

CPU: AMD Ryzen 7 1700X, Motherboard: ASRock X370 Killer SLI (changed from different model without system reinstall):

> QueryPerformanceCounterTest.exe 10000000
Executing QueryPerformanceCounter x 10000000...
According to GetTickCount64 it took 0:00:24.906 (2490.6 ns per call)
According to QueryPerformanceCounter it took 0:00:24.911 (2491.1 ns per call)

As you can see, the function takes 11 nanoseconds on first platform and 2.49 microsenonds (220 times more!) on the second one. This was the cause of slowness of my program. The program calls this function many times.

I tried to fix it and somehow convince Windows to use the fast implementation. I uninstalled and reinstalled motherboard drivers - the latest ones downloaded from manufacturer website. I upgraded and downgraded BIOS to different versions. I booted the system from Windows installation media and "repaired" it again. I restored default settings in UEFI/BIOS and tried to change "ACPI HPET Table" option there to Disabled/Enabled/Auto. Nothing worked. Finally I restored Windows to factory settings (Settings > Update & Security > Recovery > Reset this PC). This solved my problem, but unfortunately it's like reinstalling Windows from scratch - now I need to install and configure all the apps again. After that the function takes 22 ns on this machine.

My conclusions from this "adventure" are twofold:

  1. It is valid for function QueryPerformanceCounter to execute slowly on some platforms, like for 2.5 microseconds. If you call it just once per rendering frame then it doesn't matter, but you shouldn't profile every small portion of your code with it, calling it millions of times.
  2. Windows 10 still requires reinstallation when changing motherboard. Otherwise, even if it seems to work, you may experience strange issues like this one.

Update 2017-12-11: A colleague told me that enabling/disabling HPET using "bcdedit" system command could possibly help for that issue.

Update 2018-12-17: Blog post "Ryzen Threadripper for Game Development – optimising UE4 build times" on GPUOpen.com, section "HPET timer woes", seems to be related to this topic.

Comments | #windows #optimization #hardware #winapi Share

# Experience with Game in Scheme

Thu
06
Nov 2014

Last week I talked with a colleague about optimizing performance of his game project. It's written in Scheme (functional programming language, dialect of Lisp), using SDL and OpenGL. The bottleneck was GUI rendering.

When analyzing code, we found that the function for rendering text renders TTF font to a new surface with function TTF_RenderUTF8_*, then creates another SDL_Surface to convert format, then creates and fills OpenGL texture with glTexImage2D. The texture is then used just once for rendering... to another freshly allocated SDL_Surface, to perform clipping. All this happens for every GUI control, in every frame! Finally the objects are freed by Scheme garbage collector.

He then asked me whether we should find a way not to render the GUI part of the screen every frame, but only when something changes. He also cited the famous and often misused quote of Donald Knuth: "Premature optimization is the root of all evil".

I explained to him that modern GPU-s are able to render millions of triangles every frame and redrawing whole screen every frame is a standard practice. It's resource creation (allocation and filling of surfaces and textures) what should be avoided and not done every frame. If using SDL_ttf for text rendering, a final texture/surface should be prepared once and used to do just rendering in every frame, until the text changes. Additionally, calls like glViewport or glDisable(GL_DEPTH_TEST) also don't have to be made for every object in every frame.

When thinking about it now, the general rule of game optimization could be:

About functional programming and stuff like that, I think such high level concepts are good for software development as long as they are accompanied with understanding of what's under the hood. Some principles of functional programming, like avoiding side effects and just transforming one list of items to the other, are similar to the idea of DOD (Data-Oriented Design), used in game development for efficiency and scalability. But I don't agree that thinking about efficiency should be avoided until necessary or that optimization makes code complex and unreadable. Quite contrary - I believe simple code is both readable and efficient.

As I have little experience with functional programming, for me it was also fascinating to learn basics of Scheme. It's based on simple principles yet it is so powerful. I'm now thinking about how a language like this could be applied everywhere, from program configuration and as a description language, to game scripting and procedural media generation :)

Comments | #functional #optimization Share

# Writing Efficient C++ Code - my Article in ProgramistaMag

Sat
27
Apr 2013

I started cooperation with Programista - Polish magazine for programmers and other IT professionals. You can find my first article in the latest issue 4/2013 (11).

My article - "Writing Efficient C++ Code" - is about achieving best possible efficiency of native C++ code. It describes data-oriented design as an alternative to pure object-oriented philosophy and shows its advantages, mostly related to consciously designed layout of data in memory, which makes good use of CPU cache. It also mentions operations that should be avoided if code is to be efficient, shows some language tricks and Visual C++ project options that help with generating efficient code and mentions parallelization.

Besides, in each issue of the magazine you can find interesting articles about different programming languages, libraries and technologies, as well as interviews, book reviews and other articles related to software development. The magazine is available for subscription in electronic and paper form.

Comments | #c++ #optimization #productions Share

Pages: 1 2 >

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2021