I recently broke my rule of posting on my blog at least once a month as I had some other topics and problems to handle in my life, but I'm still alive, still doing graphics programming for a living, so I hope to get back to blogging now. This post is more like a question rather than an answer. It is about creative use of GPU fixed-function hardware. Warning: It may be pretty difficult for beginners, full of graphics programming terms you should already know to understand it. But first, here is some background:
I remember the times when graphics cards were only configurable, not programmable. There were no shaders, only a set of parameters that could control pre-defined operations - transform of vertices, texturing and lighting of pixels. Then, shaders appeared. They evolved by supporting more instructions to be executed and a wider variety of instructions available. At some point, even before the invention of compute shaders, the term “general-purpose computing on GPU” (GPGPU) appeared. Developers started encoding some data as RGBA colors of texture pixels and drawing full-screen quads just to launch calculation of some non-graphical tasks, implemented as pixel shaders. Soon after, compute shaders appeared, so they no longer need to pretend anything - they can now spawn a set of threads that can just read and write memory freely through Direct3D unordered access views aka Vulkan storage images and buffers.
GPUs seem to become more universal over time, with more and more workloads done as compute shaders these days. Will we end up with some generic, highly parallel compute machines with no fixed-function hardware? I don’t know. But Nanite technology from the new Unreal Engine 5 makes a step in this direction by implementing its own rasterizer for some of its triangles, in form of a compute shader. I recommend a good article about it: “A Macro View of Nanite – The Code Corsair” (it seems the link is broken already - here is a copy on Wayback Machine Internet Archive). Apparently, for tiny triangles of around single pixel size, custom rasterization is faster than what GPUs provide by default.
But in the same article we can read that Epic also does something opposite in Nanite: they use some fixed-function parts of the graphics pipeline very creatively. When applying materials in screen space, they render a full-screen pass per each material, but instead of drawing just a full-screen triangle, they do a regular triangle grid with quads covering tiles of NxN pixels. They then perform a coarse-grained culling of these tiles in a vertex shader. In order to reject one, they output vertex position = NaN, which makes a triangle incorrect and not spawning any pixels. Then, a more fine-grained culling is performed using Z-test. Per-pixel material identifier is encoded as depth in a depth buffer! This can be fast, as modern GPUs apply “HiZ” - an internal optimization to reject whole groups of pixels that fail Z-test even before their pixel shaders are launched.
This reminded me of another creative use of the graphics pipeline I observed in one game a few years ago. That pass was calculating luminance histogram of a scene. They also rendered a regular grid of geometry in screen space, but with “point list” topology. Each vertex was sampling and calculating average luminance of its region. On the other end, the histogram texture of Nx1 pixels was bound as a render target. Measured luminance of a region was returned as vertex position, while incrementation of the specific place on the histogram was ensured using additive blending. I suspect this is not the most optimal way of doing this, a compute shader using atomics could probably do it faster, but it surely was very creative and took me some time to figure out what that pass is really doing and how is it doing it.
After all, GPUs have many fixed-function elements next to their shader cores. Vertex fetch, texture sampling (with mip level calculation, trilinear and anisotropic filtering), tessellation, rasterization, blending, all kinds of primitive culling and pixel testing, even vertex homogeneous divide... Although not included in the calculation of TFLOPS power, these are real transistors with compute capabilities, just very specialized. Do you know any other smart, creative uses of them?