Tag: optimization

Entries for tag "optimization", ordered from most recent. Entry count: 6.

Warning! Some information on this page is older than 5 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Pages: 1

# Efficient way of using std::vector

Sat
22
Sep 2018

Some people say that C++ STL is slow. I would rather say it's the way we use it. Sure, there are many potential sources of slowness when using STL. For example, std::list or std::map tends to allocate many small objects and dynamic allocation is a time-consuming operation. Making many copies of your objects like std::string is also costly - that's why I created str_view project. But std::vector is just a wrapper over a dynamically allocated array that also remembers its size and can reallocate when you add new elements. Without STL, you would need to implement the same functionality yourself.

When it comes to traversing elements of the vector (e.g. to sum the numerical values contained in it), there are many ways to do it. STL is notoriously known for working very slow in Debug project configuration, but as it turns out, this heavily depends on what method do you choose for using it.

Here is a small experiment that I've just made. In this code, I create a vector of 100,000,000 integers, then sum its elements using 5 different methods, calculating how much time does it take for each of them. Results (averaged over 5 iterations for each method) are as follows. Notice logarithmic scale on horizontal axis.

Here is the full source code of my testing program:

#include <cstdio>
#include <cstdint>
#include <vector>
#include <chrono>
#include <numeric>

typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::high_resolution_clock::duration duration;
inline time_point now() { return std::chrono::high_resolution_clock::now(); }
inline double durationToMilliseconds(duration d) { return std::chrono::duration<double, std::milli>(d).count(); }

int main()
{
    printf("Iteration,Method,Sum,Time (ms)\n");
    
    for(uint32_t iter = 0; iter < 5; ++iter)
    {
        std::vector<int> numbers(100000000ull);
        numbers[0] = 1; numbers[1] = 2; numbers.back() = 3;

        {
            time_point timeBeg = now();

            // Method 1: Use STL algorithm std::accumulate.
            int sum = std::accumulate(numbers.begin(), numbers.end(), 0);

            printf("%u,accumulate,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 2: Use the new C++11 range-based for loop.
            int sum = 0;
            for(auto value : numbers)
                sum += value;

            printf("%u,Range-based for loop,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 3: Use traditional loop, traverse vector using its iterator.
            int sum = 0;
            for(auto it = numbers.begin(); it != numbers.end(); ++it)
                sum += *it;

            printf("%u,Loop with iterator,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 4: Use traditional loop, traverse using index.
            int sum = 0;
            for(size_t i = 0; i < numbers.size(); ++i)
                sum += numbers[i];

            printf("%u,Loop with indexing,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 5: Get pointer to raw array and its size, then use a loop to traverse it.
            int sum = 0;
            int* dataPtr = numbers.data();
            size_t count = numbers.size();
            for(size_t i = 0; i < count; ++i)
                sum += dataPtr[i];

            printf("%u,Loop with pointer,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }
    }
}

As you can see, some methods are slower than the others in Debug configurations by more than 3 orders of magnitude! The difference is so big that if you write your program or game like this, it may not be possible to use its Debug version with any reasonably-sized input data. But if you look at disassembly, it should be no surprise. For example, method 4 calls vector methods size() and operator[] in every iteration of the loop. We know that in Debug configuration functions are not inilined and optimized, so these are real function calls:

On the other hand, method 5 that operates on raw pointer to the vector's underlying data is not that much slower in Debug configuration comparing to Release. Disassembly from Debug version:

So my conclusion is: Using std::vector to handle memory management and reallocation and using raw pointer to access its data is the best way to go.

My testing environment was:

CPU: Intel Core i7-6700K 4.00 GHz
RAM: DDR4, Dual-Channel, current memory clock 1066 MHz
OS: Windows 10 Version 1803 (OS Build 17134.285)
Compiler: Microsoft Visual Studio Community 2017 Version 15.4.8
Configuration options: x64 Debug/Release
Windows SDK Version 10.0.16299.0

Comments | #stl #c++ #optimization Share

# Rendering Optimization - My Talk at Warsaw University of Technology

Tue
12
Dec 2017

If you happen to be in Warsaw tomorrow (2017-12-13), I'd like to invite you to my talk at Warsaw University of Technology. On the weekly meeting of Polygon group, this time the first talk will be about about artificial intelligence in action games (by Kacpi), followed by mine about rendering optimization. It will be technical, but I think it should be quite easy to understand. I won't show a single line of code. I will just give some tips for getting good performance when rendering 3D graphics on modern GPUs. I will also show some tools that can help with performance profiling. It will be all in Polish. The event starts at 7 p.m. Entrance is free. See also Facebook event. Traditionally after the talks we all go for a beer :)

Comments | #teaching #graphics #gpu #optimization Share

# When QueryPerformanceCounter call takes long time

Sun
03
Dec 2017

QueryPerformanceCounter function is all about measuring time and profiling performance, so I wasn't able to formulate right Google query to find a solution to the problem I had - call to QueryPerformanceCounter function itself taking too much time. Below I describe what I eventually found out.

It all started from hardware failure. My motherboard stopped working, so I needed to buy a new one (ASRock X370 Killer SLI). I know that normally changing motherboard requires reinstalling Windows, but I tried not to do it. The system didn't want to boot, so I booted the PC using pendrive with Windows installer and launched the repair function. It helped - after that Windows was able to start and everything seemed to work... until I launched the program that I develop on that machine. It was running painfully slow.

I tried different things to find out what is happening. Input/output to hard drive or network was not an issue. GPU performance was also OK. It seemed that the app is just doing its calculations slowly, like the CPU was very slow. I double-checked actual CPU and RAM frequency, but it was OK. Finally I launched sampling profiler (the one embedded in Visual Studio - command: Analyze > Performance Profiler). This way I found that most of the time is spent in function QueryPerformanceCounter.

This WinAPI function is the recommended way to obtain a timestamp in Windows. It's very precise, monotonic, safe to use on multiple cores and threads, it has stable frequency independent of CPU power management or Turbo Boost... It's just great, but in order to meet all these requirements, Windows may use different methods to implement it, as described in article Acquiring high-resolution time stamps. Some of them are fast (just reading TSC register), others are slow (require system call - transition to kernel mode).

I wrote a simple C++ program that tests how long it takes to execute QueryPerformanceCounter function. You can see the code here: QueryPerformanceCounterTest.cpp and download 64-bit binary here: QueryPerformanceCounterTest.zip. Running this test on two different machines gave following results:

CPU: Intel Core i7-6700K, Motherboard: GIGABYTE Z170-HD3-CF:

> QueryPerformanceCounterTest.exe 1000000000
Executing QueryPerformanceCounter x 1000000000...
According to GetTickCount64 it took 0:00:11.312 (11.312 ns per call)
According to QueryPerformanceCounter it took 0:00:11.314 (11.314 ns per call)

CPU: AMD Ryzen 7 1700X, Motherboard: ASRock X370 Killer SLI (changed from different model without system reinstall):

> QueryPerformanceCounterTest.exe 10000000
Executing QueryPerformanceCounter x 10000000...
According to GetTickCount64 it took 0:00:24.906 (2490.6 ns per call)
According to QueryPerformanceCounter it took 0:00:24.911 (2491.1 ns per call)

As you can see, the function takes 11 nanoseconds on first platform and 2.49 microsenonds (220 times more!) on the second one. This was the cause of slowness of my program. The program calls this function many times.

I tried to fix it and somehow convince Windows to use the fast implementation. I uninstalled and reinstalled motherboard drivers - the latest ones downloaded from manufacturer website. I upgraded and downgraded BIOS to different versions. I booted the system from Windows installation media and "repaired" it again. I restored default settings in UEFI/BIOS and tried to change "ACPI HPET Table" option there to Disabled/Enabled/Auto. Nothing worked. Finally I restored Windows to factory settings (Settings > Update & Security > Recovery > Reset this PC). This solved my problem, but unfortunately it's like reinstalling Windows from scratch - now I need to install and configure all the apps again. After that the function takes 22 ns on this machine.

My conclusions from this "adventure" are twofold:

  1. It is valid for function QueryPerformanceCounter to execute slowly on some platforms, like for 2.5 microseconds. If you call it just once per rendering frame then it doesn't matter, but you shouldn't profile every small portion of your code with it, calling it millions of times.
  2. Windows 10 still requires reinstallation when changing motherboard. Otherwise, even if it seems to work, you may experience strange issues like this one.

Update 2017-12-11: A colleague told me that enabling/disabling HPET using "bcdedit" system command could possibly help for that issue.

Comments | #winapi #optimization #hardware #windows Share

# Experience with Game in Scheme

Thu
06
Nov 2014

Last week I talked with a colleague about optimizing performance of his game project. It's written in Scheme (functional programming language, dialect of Lisp), using SDL and OpenGL. The bottleneck was GUI rendering.

When analyzing code, we found that the function for rendering text renders TTF font to a new surface with function TTF_RenderUTF8_*, then creates another SDL_Surface to convert format, then creates and fills OpenGL texture with glTexImage2D. The texture is then used just once for rendering... to another freshly allocated SDL_Surface, to perform clipping. All this happens for every GUI control, in every frame! Finally the objects are freed by Scheme garbage collector.

He then asked me whether we should find a way not to render the GUI part of the screen every frame, but only when something changes. He also cited the famous and often misused quote of Donald Knuth: "Premature optimization is the root of all evil".

I explained to him that modern GPU-s are able to render millions of triangles every frame and redrawing whole screen every frame is a standard practice. It's resource creation (allocation and filling of surfaces and textures) what should be avoided and not done every frame. If using SDL_ttf for text rendering, a final texture/surface should be prepared once and used to do just rendering in every frame, until the text changes. Additionally, calls like glViewport or glDisable(GL_DEPTH_TEST) also don't have to be made for every object in every frame.

When thinking about it now, the general rule of game optimization could be:

About functional programming and stuff like that, I think such high level concepts are good for software development as long as they are accompanied with understanding of what's under the hood. Some principles of functional programming, like avoiding side effects and just transforming one list of items to the other, are similar to the idea of DOD (Data-Oriented Design), used in game development for efficiency and scalability. But I don't agree that thinking about efficiency should be avoided until necessary or that optimization makes code complex and unreadable. Quite contrary - I believe simple code is both readable and efficient.

As I have little experience with functional programming, for me it was also fascinating to learn basics of Scheme. It's based on simple principles yet it is so powerful. I'm now thinking about how a language like this could be applied everywhere, from program configuration and as a description language, to game scripting and procedural media generation :)

Comments | #functional #optimization Share

# Writing Efficient C++ Code - my Article in ProgramistaMag

Sat
27
Apr 2013

I started cooperation with Programista - Polish magazine for programmers and other IT professionals. You can find my first article in the latest issue 4/2013 (11).

My article - "Writing Efficient C++ Code" - is about achieving best possible efficiency of native C++ code. It describes data-oriented design as an alternative to pure object-oriented philosophy and shows its advantages, mostly related to consciously designed layout of data in memory, which makes good use of CPU cache. It also mentions operations that should be avoided if code is to be efficient, shows some language tricks and Visual C++ project options that help with generating efficient code and mentions parallelization.

Besides, in each issue of the magazine you can find interesting articles about different programming languages, libraries and technologies, as well as interviews, book reviews and other articles related to software development. The magazine is available for subscription in electronic and paper form.

Comments | #c++ #optimization #productions Share

# IGK Conference - Our Slides

Mon
28
Mar 2011

8th Polish Game Engineering Conference (VIII Ogólnopolska Konferencja Inżynierii Gier Komputerowych) IGK-8'2011 is over so now we can publish slides from our presentation:

Tworzenie wydajnego kodu c++ w podejściu zorientowanym na dane (PDF)

It's in Polish. Here is abstract of the paper:

(Polish) Artykuł opisuje wady programowania obiektowego – zarówno od strony projektowej, jak i ze względu na wydajność kodu. Porusza problem opóźnienia w dostępie do pamięci RAM we współczesnych architekturach komputerowych. Przedstawia programowanie zorientowane na dane (ang. DOD –­ Data-Oriented Design) jako alternatywne podejście do projektowania i implementowania silnika gry kładące nacisk na optymalizację struktur danych pod kątem szybkości. Porusza także problem wydajności poszczególnych konstrukcji języka C++. 

(English) Paper describes pitfalls of object-oriented programming – from the design perspective, as well as regarding code performance. It mentions problem of latency in accessing data in RAM memory on today computer architectures. It shows DOD (Data-Oriented Design) as an alternative approach to design and implementation of a game engine focused on optimizing data structures in terms of performance. It also describes efficiency of different C++ language constructs. 

Comments | #events #teaching #warsztat #igk #c++ #optimization Share

Pages: 1

STAT NO AD
[Stat] [STAT NO AD] [Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2018