Tag: optimization

Entries for tag "optimization", ordered from most recent. Entry count: 7.

Warning! Some information on this page is older than 5 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Pages: 1

# Thoughts on graphics APIs and libraries

Thu
07
Feb 2019

Warning: This is a long rant. I’d like to share my personal thoughts and opinions on graphics APIs like Vulkan, Direct3D 12.

Some time ago I came up with a diagram showing how the graphics software technologies evolved over last decades – see my blog post “Lower-Level Graphics API - What Does It Mean?”. The new graphics APIs (Direct3D 12, Vulkan, Metal) are not only a clean start, so they abandon all the legacy garbage going back to ‘90s (like glVertex), but they also take graphics programming to a new level. It is a lower level – they are more explicit, closer to the hardware, and better match how modern GPUs work. At least that’s the idea. It means simpler, more efficient, and less error-prone drivers. But they don’t make the game or engine programming simpler. Quite the opposite – more responsibilities are now moved to engine developers (e.g. memory management/allocation). Overall, it is commonly considered a good thing though, because the engine has higher-level knowledge of its use cases (e.g. which textures are critically important and which can be unloaded when GPU memory is full), so it can get better performance by doing it properly. All this is hidden in the engines anyway, so developers making their games don’t notice the difference.

Those of you, who – just like me – deal with those low-level graphics APIs in their everyday work, may wonder if these APIs provide the right level of abstraction. I know it will sound controversial, but sometimes I get a feeling they are at the exactly worst possible level – so low they are difficult to learn and use properly, while so high they still hide some implementation details important for getting a good performance. Let’s take image/texture barriers as an example. They were non-existent in previous APIs. Now we have to do them, which is a major pain point when porting old code to a new API. Do too few of them and you get graphical corruptions on some GPUs and not on the others. Do too many and your performance can be worse than it has been on DX11 or OGL. At the same time, they are an abstract concept that still hides multiple things happening under the hood. You can never be sure which barrier will flush some caches, stall the whole graphics pipeline, or convert your texture between internal compression formats on a specific GPU, unless you use some specialized, vendor-specific profiling tool, like Radeon GPU Profiler (RGP).

It’s the same with memory. In DX11 you could just specify intended resource usage (D3D11_USAGE_IMMUTABLE, D3D11_USAGE_DYNAMIC) and the driver chose preferred place for it. In Vulkan you have to query for memory heaps available on the current GPU and explicitly choose the one you decide best for your resource, based on low-level flags like VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT etc. AMD exposes 4 memory types and 3 memory heaps. Nvidia has 11 types and 2 heaps. Intel integrated graphics exposes just 1 heap and 2 types, showing the memory is really unified, while AMD APU, also integrated, has same memory model as the discrete card. If you try to match these to what you know about physically existing video RAM and system RAM, it doesn’t make any sense. You could just pick the first DEVICE_LOCAL memory for the fastest GPU access, but even then, you cannot be sure your resource will stay in video RAM. It may be silently migrated to system RAM without your knowledge and consent (e.g. if you go out of memory), which will degrade performance. What is more, there is no way to query for the amount of free GPU memory in Vulkan, unless you do hacks like using DXGI.

Hardware queues are no better. Vulkan claims to give explicit access to the pieces of GPU hardware, so you need to query for queues that are available. For example, Intel exposes only a single graphics queue. AMD lets you create up to 3 additional compute-only queues and 2 transfer queues. Nvidia has 8 compute queues and 1 transfer queue. Do they all really map to silicon that can work in parallel? I doubt it. So how many of them to use to get the best performance? There is no way to tell by just using Vulkan API. AMD promotes doing compute work in parallel with 3D rendering while Nvidia diplomatically advises to be “conscious” with it.

It's the same with presentation modes. You have to enumerate VkPresentModeKHR-s available on the machine and choose the right one, along with number of images in the swapchain. These don't map intuitively to a typical user-facing setting of V-sync = on/off, as they are intended to be low level. Still you have no control and no way to check whether the driver does "blit" or "flip".

One could say the new APIs don’t deliver to their promise of being low level, explicit, and having predictable performance. It is impossible to deliver, unless the API is specific to one GPU, like there is on consoles. A common API over different GPUs is always high level, things happen under the hood, and there are still fast and slow paths. Isn’t all this complexity just for nothing? It may be true that comparing to previous generation APIs, drivers for the new ones need not launch additional threads in the background or perform shader compilation on first draw call, which greatly reduces chances of major hitching. (We will see how long this state will persist as the APIs and drivers evolve.) Still there is no way to predict or ensure minimum FPS/maximum frame time. We are talking about systems where multiple processes compete for resources. On modern PCs there is even no way to know how many cycles will a single instruction take! Cache memory, branch prediction, out-of-order execution – all of these mechanisms are there in the CPU to speed up average cases, but there can always be cases when it works slowly (e.g. cache miss). It’s the same with graphics. I think we should abandon the false hope of predictable performance as a thing of the past, just like rendering graphics pixel-perfect. We can optimize for the average, but we cannot ensure the minimum. After all, games are “soft real-time systems”.

Based on that, I am thinking if there is a room for a new graphics API or top of DX12 or Vulkan. I don’t mean whole game engine with physical simulation, handling sound, input controllers and all, like Unity or UE4. I mean an API just like DX11 or OGL, on a similar or higher abstraction level (if higher level, maybe the concept of persistent “frame graph” with explicit pass and resource dependencies is the way to go?). I also don’t think it’s enough to just reimplement any of those old APIs. The new one should take advantage of features of the explicit APIs (like parallel command buffer recording), while hiding the difficult parts (e.g. queues, memory types, descriptors, barriers), so it’s easier to use and harder to misuse. (An existing library similar to this concept is V-EZ from AMD.) I think it may still have good performance. The key thing needed for creation of such library is abandoning the assumption that developer must define everything up-front, with nothing allocated, created, or transferred on first use.

See also next post: "How to design API of a library for Vulkan?"

Update 2019-02-12: I want to thank all of you for the amazing feedback I received after publishing this post, especially on Twitter. Many projects have been mentioned that try to provide an API better than Vulkan or DX12 - e.g. Apple Metal, WebGPU, The Forge by Confetti.

Comments | #gpu #optimization #graphics #libraries #directx #vulkan Share

# Efficient way of using std::vector

Sat
22
Sep 2018

Some people say that C++ STL is slow. I would rather say it's the way we use it. Sure, there are many potential sources of slowness when using STL. For example, std::list or std::map tends to allocate many small objects and dynamic allocation is a time-consuming operation. Making many copies of your objects like std::string is also costly - that's why I created str_view project. But std::vector is just a wrapper over a dynamically allocated array that also remembers its size and can reallocate when you add new elements. Without STL, you would need to implement the same functionality yourself.

When it comes to traversing elements of the vector (e.g. to sum the numerical values contained in it), there are many ways to do it. STL is notoriously known for working very slow in Debug project configuration, but as it turns out, this heavily depends on what method do you choose for using it.

Here is a small experiment that I've just made. In this code, I create a vector of 100,000,000 integers, then sum its elements using 5 different methods, calculating how much time does it take for each of them. Results (averaged over 5 iterations for each method) are as follows. Notice logarithmic scale on horizontal axis.

Here is the full source code of my testing program:

#include <cstdio>
#include <cstdint>
#include <vector>
#include <chrono>
#include <numeric>

typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::high_resolution_clock::duration duration;
inline time_point now() { return std::chrono::high_resolution_clock::now(); }
inline double durationToMilliseconds(duration d) { return std::chrono::duration<double, std::milli>(d).count(); }

int main()
{
    printf("Iteration,Method,Sum,Time (ms)\n");
    
    for(uint32_t iter = 0; iter < 5; ++iter)
    {
        std::vector<int> numbers(100000000ull);
        numbers[0] = 1; numbers[1] = 2; numbers.back() = 3;

        {
            time_point timeBeg = now();

            // Method 1: Use STL algorithm std::accumulate.
            int sum = std::accumulate(numbers.begin(), numbers.end(), 0);

            printf("%u,accumulate,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 2: Use the new C++11 range-based for loop.
            int sum = 0;
            for(auto value : numbers)
                sum += value;

            printf("%u,Range-based for loop,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 3: Use traditional loop, traverse vector using its iterator.
            int sum = 0;
            for(auto it = numbers.begin(); it != numbers.end(); ++it)
                sum += *it;

            printf("%u,Loop with iterator,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 4: Use traditional loop, traverse using index.
            int sum = 0;
            for(size_t i = 0; i < numbers.size(); ++i)
                sum += numbers[i];

            printf("%u,Loop with indexing,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 5: Get pointer to raw array and its size, then use a loop to traverse it.
            int sum = 0;
            int* dataPtr = numbers.data();
            size_t count = numbers.size();
            for(size_t i = 0; i < count; ++i)
                sum += dataPtr[i];

            printf("%u,Loop with pointer,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }
    }
}

As you can see, some methods are slower than the others in Debug configurations by more than 3 orders of magnitude! The difference is so big that if you write your program or game like this, it may not be possible to use its Debug version with any reasonably-sized input data. But if you look at disassembly, it should be no surprise. For example, method 4 calls vector methods size() and operator[] in every iteration of the loop. We know that in Debug configuration functions are not inilined and optimized, so these are real function calls:

On the other hand, method 5 that operates on raw pointer to the vector's underlying data is not that much slower in Debug configuration comparing to Release. Disassembly from Debug version:

So my conclusion is: Using std::vector to handle memory management and reallocation and using raw pointer to access its data is the best way to go.

My testing environment was:

CPU: Intel Core i7-6700K 4.00 GHz
RAM: DDR4, Dual-Channel, current memory clock 1066 MHz
OS: Windows 10 Version 1803 (OS Build 17134.285)
Compiler: Microsoft Visual Studio Community 2017 Version 15.4.8
Configuration options: x64 Debug/Release
Windows SDK Version 10.0.16299.0

Comments | #stl #c++ #optimization Share

# Rendering Optimization - My Talk at Warsaw University of Technology

Tue
12
Dec 2017

If you happen to be in Warsaw tomorrow (2017-12-13), I'd like to invite you to my talk at Warsaw University of Technology. On the weekly meeting of Polygon group, this time the first talk will be about about artificial intelligence in action games (by Kacpi), followed by mine about rendering optimization. It will be technical, but I think it should be quite easy to understand. I won't show a single line of code. I will just give some tips for getting good performance when rendering 3D graphics on modern GPUs. I will also show some tools that can help with performance profiling. It will be all in Polish. The event starts at 7 p.m. Entrance is free. See also Facebook event. Traditionally after the talks we all go for a beer :)

Comments | #teaching #graphics #gpu #optimization Share

# When QueryPerformanceCounter call takes long time

Sun
03
Dec 2017

QueryPerformanceCounter function is all about measuring time and profiling performance, so I wasn't able to formulate right Google query to find a solution to the problem I had - call to QueryPerformanceCounter function itself taking too much time. Below I describe what I eventually found out.

It all started from hardware failure. My motherboard stopped working, so I needed to buy a new one (ASRock X370 Killer SLI). I know that normally changing motherboard requires reinstalling Windows, but I tried not to do it. The system didn't want to boot, so I booted the PC using pendrive with Windows installer and launched the repair function. It helped - after that Windows was able to start and everything seemed to work... until I launched the program that I develop on that machine. It was running painfully slow.

I tried different things to find out what is happening. Input/output to hard drive or network was not an issue. GPU performance was also OK. It seemed that the app is just doing its calculations slowly, like the CPU was very slow. I double-checked actual CPU and RAM frequency, but it was OK. Finally I launched sampling profiler (the one embedded in Visual Studio - command: Analyze > Performance Profiler). This way I found that most of the time is spent in function QueryPerformanceCounter.

This WinAPI function is the recommended way to obtain a timestamp in Windows. It's very precise, monotonic, safe to use on multiple cores and threads, it has stable frequency independent of CPU power management or Turbo Boost... It's just great, but in order to meet all these requirements, Windows may use different methods to implement it, as described in article Acquiring high-resolution time stamps. Some of them are fast (just reading TSC register), others are slow (require system call - transition to kernel mode).

I wrote a simple C++ program that tests how long it takes to execute QueryPerformanceCounter function. You can see the code here: QueryPerformanceCounterTest.cpp and download 64-bit binary here: QueryPerformanceCounterTest.zip. Running this test on two different machines gave following results:

CPU: Intel Core i7-6700K, Motherboard: GIGABYTE Z170-HD3-CF:

> QueryPerformanceCounterTest.exe 1000000000
Executing QueryPerformanceCounter x 1000000000...
According to GetTickCount64 it took 0:00:11.312 (11.312 ns per call)
According to QueryPerformanceCounter it took 0:00:11.314 (11.314 ns per call)

CPU: AMD Ryzen 7 1700X, Motherboard: ASRock X370 Killer SLI (changed from different model without system reinstall):

> QueryPerformanceCounterTest.exe 10000000
Executing QueryPerformanceCounter x 10000000...
According to GetTickCount64 it took 0:00:24.906 (2490.6 ns per call)
According to QueryPerformanceCounter it took 0:00:24.911 (2491.1 ns per call)

As you can see, the function takes 11 nanoseconds on first platform and 2.49 microsenonds (220 times more!) on the second one. This was the cause of slowness of my program. The program calls this function many times.

I tried to fix it and somehow convince Windows to use the fast implementation. I uninstalled and reinstalled motherboard drivers - the latest ones downloaded from manufacturer website. I upgraded and downgraded BIOS to different versions. I booted the system from Windows installation media and "repaired" it again. I restored default settings in UEFI/BIOS and tried to change "ACPI HPET Table" option there to Disabled/Enabled/Auto. Nothing worked. Finally I restored Windows to factory settings (Settings > Update & Security > Recovery > Reset this PC). This solved my problem, but unfortunately it's like reinstalling Windows from scratch - now I need to install and configure all the apps again. After that the function takes 22 ns on this machine.

My conclusions from this "adventure" are twofold:

  1. It is valid for function QueryPerformanceCounter to execute slowly on some platforms, like for 2.5 microseconds. If you call it just once per rendering frame then it doesn't matter, but you shouldn't profile every small portion of your code with it, calling it millions of times.
  2. Windows 10 still requires reinstallation when changing motherboard. Otherwise, even if it seems to work, you may experience strange issues like this one.

Update 2017-12-11: A colleague told me that enabling/disabling HPET using "bcdedit" system command could possibly help for that issue.

Update 2018-12-17: Blog post "Ryzen Threadripper for Game Development – optimising UE4 build times" on GPUOpen.com, section "HPET timer woes", seems to be related to this topic.

Comments | #windows #optimization #hardware #winapi Share

# Experience with Game in Scheme

Thu
06
Nov 2014

Last week I talked with a colleague about optimizing performance of his game project. It's written in Scheme (functional programming language, dialect of Lisp), using SDL and OpenGL. The bottleneck was GUI rendering.

When analyzing code, we found that the function for rendering text renders TTF font to a new surface with function TTF_RenderUTF8_*, then creates another SDL_Surface to convert format, then creates and fills OpenGL texture with glTexImage2D. The texture is then used just once for rendering... to another freshly allocated SDL_Surface, to perform clipping. All this happens for every GUI control, in every frame! Finally the objects are freed by Scheme garbage collector.

He then asked me whether we should find a way not to render the GUI part of the screen every frame, but only when something changes. He also cited the famous and often misused quote of Donald Knuth: "Premature optimization is the root of all evil".

I explained to him that modern GPU-s are able to render millions of triangles every frame and redrawing whole screen every frame is a standard practice. It's resource creation (allocation and filling of surfaces and textures) what should be avoided and not done every frame. If using SDL_ttf for text rendering, a final texture/surface should be prepared once and used to do just rendering in every frame, until the text changes. Additionally, calls like glViewport or glDisable(GL_DEPTH_TEST) also don't have to be made for every object in every frame.

When thinking about it now, the general rule of game optimization could be:

About functional programming and stuff like that, I think such high level concepts are good for software development as long as they are accompanied with understanding of what's under the hood. Some principles of functional programming, like avoiding side effects and just transforming one list of items to the other, are similar to the idea of DOD (Data-Oriented Design), used in game development for efficiency and scalability. But I don't agree that thinking about efficiency should be avoided until necessary or that optimization makes code complex and unreadable. Quite contrary - I believe simple code is both readable and efficient.

As I have little experience with functional programming, for me it was also fascinating to learn basics of Scheme. It's based on simple principles yet it is so powerful. I'm now thinking about how a language like this could be applied everywhere, from program configuration and as a description language, to game scripting and procedural media generation :)

Comments | #functional #optimization Share

# Writing Efficient C++ Code - my Article in ProgramistaMag

Sat
27
Apr 2013

I started cooperation with Programista - Polish magazine for programmers and other IT professionals. You can find my first article in the latest issue 4/2013 (11).

My article - "Writing Efficient C++ Code" - is about achieving best possible efficiency of native C++ code. It describes data-oriented design as an alternative to pure object-oriented philosophy and shows its advantages, mostly related to consciously designed layout of data in memory, which makes good use of CPU cache. It also mentions operations that should be avoided if code is to be efficient, shows some language tricks and Visual C++ project options that help with generating efficient code and mentions parallelization.

Besides, in each issue of the magazine you can find interesting articles about different programming languages, libraries and technologies, as well as interviews, book reviews and other articles related to software development. The magazine is available for subscription in electronic and paper form.

Comments | #c++ #optimization #productions Share

# IGK Conference - Our Slides

Mon
28
Mar 2011

8th Polish Game Engineering Conference (VIII Ogólnopolska Konferencja Inżynierii Gier Komputerowych) IGK-8'2011 is over so now we can publish slides from our presentation:

Tworzenie wydajnego kodu c++ w podejściu zorientowanym na dane (PDF)

It's in Polish. Here is abstract of the paper:

(Polish) Artykuł opisuje wady programowania obiektowego – zarówno od strony projektowej, jak i ze względu na wydajność kodu. Porusza problem opóźnienia w dostępie do pamięci RAM we współczesnych architekturach komputerowych. Przedstawia programowanie zorientowane na dane (ang. DOD –­ Data-Oriented Design) jako alternatywne podejście do projektowania i implementowania silnika gry kładące nacisk na optymalizację struktur danych pod kątem szybkości. Porusza także problem wydajności poszczególnych konstrukcji języka C++. 

(English) Paper describes pitfalls of object-oriented programming – from the design perspective, as well as regarding code performance. It mentions problem of latency in accessing data in RAM memory on today computer architectures. It shows DOD (Data-Oriented Design) as an alternative approach to design and implementation of a game engine focused on optimizing data structures in terms of performance. It also describes efficiency of different C++ language constructs. 

Comments | #events #teaching #warsztat #igk #c++ #optimization Share

Pages: 1

STAT NO AD
[Stat] [STAT NO AD] [Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2019