Principles of Compute (Part 2)


In the last post, we talked about the motivation to write data-parallel code in order to scale the performance of our programs with hardware resources. We also saw some basic designs for parallel algorithms, mostly in the abstract.

In this post, we’ll go into more detail about what aspects of hardware design exist to increase the performance of data-parallel code. There is a lot of overlap between both CPU and GPU design in this area, so this is quite generally applicable knowledge when writing parallel code. The goal is to identify aspects of hardware design that we can rely on without knowing too much about the underlying architecture, since this allows us to write code that stands the test of time. Naturally, there are still differences between CPU and GPU hardware, so these differences will be highlighted too.

Instruction Parallelism and SIMD

For the purpose of discussion, let’s consider the following parallel loop:

void times_two(float* data, int n)
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * 2.0f;

Since this is a trivially parallel loop, we can straightforwardly apply parallelism techniques to accelerate it. However, before we actually modify the code ourselves, let’s consider what the processor could do to run this code in parallel for us automatically.

Automatic Instruction Parallelism

In theory, the underlying processor could automatically determine that this code can run in parallel by observing the reads and writes being made. At each iteration, it would see a read from data[i] into a temporary register, some math on that temporary register, then a write back to data[i]. In theory, the processor could internally build dependency graphs that represent the dependencies between all reads and writes, defer the evaluation of this graph until a lot of work has been accumulated, then evaluate the work in the dependency graph by executing different code paths in parallel.

The above can sound a little bit like science fiction, but it does happen to a limited extent in processors today. Processors like CPUs and GPUs can automatically execute instructions in parallel if there does not exist a data hazard between their inputs and outputs. For example, if one instruction reads from memory and a subsequent instruction writes that memory, the processor will wait for the read to finish before executing the write, perhaps using a technique like scoreboarding. If there does not exist such a data hazard, the processor might execute the two instructions in parallel. Additionally, some processors may be able to automatically remove superfluous data dependencies using register renaming, or by making guesses on the future state of the data using speculative execution to avoid having to wait for the results to arrive.

Of course, relying on processor smarts comes at a cost. The hardware becomes more expensive, these features come with their own overhead, and it’s hard to trust that these optimizations are really happening unless you understand the processor at a very deep level (and perhaps have professional tools for verifying it, like Intel VTune). Two instructions that can execute in parallel might also be separated by enough code in between them that the processor is not able to see that they can be executed in parallel, and that the compiler isn’t allowed to perform the optimization safely either.

For example, the following addition of a print statement to “times_two” might make it too complicated for the compiler and processor to safely execute iterations of the loop in parallel, since it can’t know if the implementation of “printf” might somehow affect the contents of the “data” array.

void times_two(float* data, int n)
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * 2.0f;
        printf("data[%d] = %f\n", i, data[i]);

Manual Instruction Parallelism

In theory, we might be able to help the compiler and processor identify instructions that can run in parallel by explicitly writing them out. For example, the following code attempts (possibly in vain) to help the compiler and processor to recognize that the operations on the data array can be done in parallel, by iterating over it in steps of 4.

void times_two(float* data, int n)
    assert(n % 4 == 0);

    for (int i = 0; i < n; i += 4)
        for (int j = 0; j < 4; j++) {
            data[i + j] = data[i + j] * 2.0f;

        for (int j = 0; j < 4; j++) {
            printf("data[%d] = %f\n", i + j, data[i + j]);

Assuming the processor can actually understand the intention of this code, the situation is still not great. This setup outputs much more bytecode, which may hurt the efficiency of the instruction cache, and still relies on the processor’s ability to dynamically identify instructions that can execute in parallel.

In the face of the difficulties in automatically executing instructions in parallel, hardware designers have created instructions that allow explicitly declaring operations that run in parallel. These instructions are known as SIMD instructions, meaning “Single Instruction Multiple Data”. As hinted by “multiple data”, these instructions are very well suited to exploit data-parallelism, allowing either the compiler or the programmer to assist the processor in recognizing work that can be done in parallel.

SIMD instruction sets include parallel versions of typical arithmetic instructions. For example, the ADDPS instruction on Intel processors computes 4 additions in one instruction, which allows explicit indication to the processor that these 4 additions can be executed in parallel. Since this ADDPS instruction needs quadruple the inputs and outputs, it is defined on 128-bit registers as opposed to the typical 32-bit registers. One 128-bit register is big enough to store four different 32-bit floats, so that’s how the quadrupled inputs and outputs are stored. You can experiment with SIMD instructions sets using so-called compiler intrinsics, which allow you to use these SIMD instructions from within your C/C++ code as if they were ordinary C functions. You can use these by including a compiler-supplied header like xmmintrin.h.

As an example of applying SIMD, consider the reworked “times_two” example:

void times_two(float* data, int n)
    assert(n % 4 == 0);

    // two = { 2, 2, 2, 2 }
    __m128 two = _mm_set1_ps(2.0f);

    for (int i = 0; i < n; i += 4)
        __m128 x = _mm_load_ps(&data[i]);
        x = _mm_mul_ps(x, two);
        _mm_store_ps(&data[i], x);

This code implements the times_two example at the start of this section, but computes 4 additions in every iteration. There are a few complications:

  • The syntax is ugly and verbose.
  • It requires the input array to have a multiple of 4 in size (unless we add a special case.)
  • It requires the input array to be aligned to 16 bytes (for best performance.)
  • The code is less portable.

Writing code that assumes a SIMD width of 4 is also non-ideal when you consider that newer Intel processors can do 8-wide or 16-wide vector operations. Different GPUs also execute SIMD code in a variety of widths. Clearly, there are many good reasons to want to abstract the specifics of the instruction set in order to write more portable code more easily, which can be done automatically through the design of the programming language. This is something we’ll talk about later.

I hope we can agree that SIMD adds another level to our data-parallel decomposition. Earlier, we talked about splitting work into data-parallel pieces and distributing them among cores, and now we know that we can get more parallelism by running SIMD code on each core. SIMD compounds with multi-core in a multiplicative way, meaning that a 4-core processor with 8-wide SIMD has a maximum theoretical performance improvement of 4 * 8 = 32.

Memory Access Latency Hiding

With the compute power of multi-core and SIMD, our execution time quickly becomes bound by the speed of memory access rather than the speed of computation. This means that the performance of our algorithm is mostly dictated by how fast we can access the memory of the dataset, rather than the time it takes to perform the actual computation.

In principle, we can use our newly found computation power of multi-core and SIMD to lower the overhead of memory access by investing our spare computation in compression and decompression. For example, it may be highly advantageous to store your data set using 16-bit floating point numbers rather than 32-bit floating point numbers. You might convert these floats back to 32-bit for the purpose of running your math, but it might still be a win if the bottleneck of the operation is memory access. This is an example of lossy compression.

Another way to deal with the overhead of memory access is to create hardware which hides the latency of memory access by doing compute work while waiting for it. On CPUs, this is done through Hyper-Threading, which allows you to efficiently create more threads than there physically exists cores by working on the second thread of work where the core would normally wait for one thread’s memory access to complete.

A significant factor of improving the latency of memory access lies in the design of your algorithm and data structures. For example, if your algorithm is designed to access memory in predictable patterns, the processor is more likely to guess what you’re doing and start fetching memory ahead of your requests, which makes it more likely to be ready when the time comes to access it. Furthermore, if successive memory operations access addresses that are close to each other, it’s more likely that the memory you want is already in cache. Needless to say, cache coherency is very important.

Beyond cache effects, you may also be able to hide the latency of memory access by starting many memory operations as early as possible. If you have a loop that loops 10 times and does a load from memory at the beginning of every loop, you might improve the performance of your code by doing the loads “up front”. For example:

void foo()
    int loads[10];

    for (int i = 0; i < 10; i++) {
        loads[i] = load_thing(i);

    for (int i = 0; i < 10; i++) {

This optimization is more likely to be beneficial on an in-order core rather than an out-of-order core. This is because in-order cores execute instructions in the order in which they are read by the program, while out-of-order cores execute instructions based on the order of data-dependencies between instructions. Since out-of-order cores execute instructions in order of data-dependency, the loads that were pulled out of the loop might not execute early despite our best efforts. Instead, they might only execute at the time where the result of the load is first used.

Optimizations based on playing around with the order of instructions is generally less useful on out-of-order cores. In my experience, these optimizations on out-of-order cores often having no noticeable effect, and are often even being a detriment to the performance. CPU cores these days are typically out-of-order, but GPU cores are typically in-order, which makes these kinds of optimizations more interesting. An optimization like the one above might be already done by the GPU shader compiler without your help, but it might be worth the experiment to do it yourself, by either manually changing the code or by using compiler hints like HLSL’s “loop unroll” hint.

That said, I’ve seen cases where manually tweaking memory access has been extremely beneficial even on modern Intel processors. For example, recent Intel processors support a SIMD instruction called “gather” which can, for example, take as input a pointer to an array and four 32-bit indices into the array. The gather instruction performs these multiple indexed loads from the array in a single instruction (by the way, the equivalent for parallel indexed stores to an array is called “scatter”). As expected from its memory access, the gather instruction has a relatively high latency. This can be a problem in the implementation of, for example, perlin noise. Perlin noise implementations use a triple indirection into an array to produce seemingly random numbers, in the style of “data[data[data[i] + j] + k]”. Since each gather depends on the result of a previous gather, the three gathers need to happen completely sequentially, which means the processor basically idles while waiting for 3x the latency of memory access of gather. Manually factoring out redundant gathers or tweaking the perlin noise algorithm can get you a long way.

On the topic of dependent loads, a common target of criticism of memory access patterns is the linked list. If you’re traversing a linked list made out of nodes that are randomly scattered in memory, you can certainly expect that the access of each successive linked list node will cause a cache miss, and this is generally true. However, there’s another problem relating to linked lists, related to memory access latency. This problem comes from the fact that when iterating over a linked list, there’s no way for the processor to go to the next node until the address of the next node is loaded. This means you can bear the full brunt of memory access latency at every step of the linked list. As a solution, you might want to consider keeping more than one pointer per node, for example by using an octree rather than a binary tree.

CPUs vs GPUs?

I’ve been mostly talking CPUs so far, since they’re generally more familiar and easier to work with in example programs. However, pretty much everything I’ve said generally applies to GPUs as well. GPUs are also multi-core processors, they just have smaller cores. The cores on GPUs run SIMD code similarly to CPUs, perhaps with 8-wide or 16-wide SIMD. GPUs have much smaller caches, which makes sense, since synchronizing caches between multiple processors is inherently a very ugly problem that works against data-parallelism. By having smaller caches, GPUs pay a much higher cost for memory access, which they amortize by heavy use of executing spare computation while waiting for their memory access to complete, very similar in principle to hyper-threading.

Conclusion/To Be Continued

In the next post, we’ll talk about how the data-parallel uses of multi-core and SIMD are abstracted by current compute languages, so we can finally understand the idiosyncracies of compute languages.


Principles of Compute (Part 1)


The free lunch is over, and all software engineers need to understand how to write parallel code if they want to continue seeing performance improvements with future CPUs. Similarly, we have to think parallel if we want to take advantage of massively parallel computation resources like GPUs or FPGAs.

To fully exploit these hardware resources, people started designing so-called compute languages. For example, the CUDA language allows you to write massively parallel programs to run on NVIDIA GPUs, and the OpenCL language allows you to write massively parallel programs that can efficiently target CPUs, GPUs, and FPGAs. Graphics APIs like OpenGL and DirectX also allow you to write massively parallel geometry and pixel processing programs that run on GPUs, and on CPUs too with software renderers like WARP.

In recent years, OpenGL and DirectX have embraced so-called “compute shaders”, which allow general-purpose compute programs to run alongside the traditional geometry and pixel processing programs. Compute shaders brought significant innovation to renderer design, and are now an integral part of advanced rendering engines.

So Why Do We Care?

Let’s take a step back and think. Everything we’re talking about here is strictly for the purpose of optimization. Compute programs don’t let you implement logic that wasn’t possible to implement previously. At best, compute languages let you implement the same logic more efficiently, at the likely cost of more development time. Of course, performance is still very important. If we didn’t care about performance, we wouldn’t be using GPUs at all, but if you’re a gamer or a game developer, you probably know how important high-performance GPUs are for 3D gaming graphics to work at enjoyable frame rates.

The reason I’m emphasizing the importance of optimization is that writing efficient programs is highly dependent on your understanding of the relevant computer hardware. We’ve entered the domain of embedded systems, meaning that we are designing software systems with the intention of running them on a specific kind of hardware. As a software engineer, you are solving a problem (“I want to make such-and-such video game”), and you must solve this problem with a hybrid hardware-software solution. Hardware details are more than just a good conversation starter at software development conferences… Understanding the hardware is your job, and it’s the reason employers will pay top dollar to borrow your brain. It’s the argument of kings: This is embedded systems. We care.

Why Read This Article

The goal of this article is to gain an intuitive understanding of how compute works. Instead of memorizing arbitrary OpenGL/OpenCL/DirectX/CUDA terminology, we’ll put ourselves in the shoes of somebody who wants to write programs that fully exploit hardware resources, and we’ll see how this leads to the design of the compute languages we know and love today.

By understanding the principles behind the API, you can mostly spare yourself from having to memorize any API details at all. You can make your design on paper independently from whether you’re using OpenGL or DirectX or whatever, then you can translate it to whatever API-specific syntax when the time comes to implement your design.

Furthermore, if you do try to understand compute through only (eg) OpenGL’s documentation, you might find it very difficult. The specifications of compute languages are vague by nature, since they give generous flexibility to be implemented on widely different hardware. This vagueness makes it hard to understand these specifications… Unless you already have a lot of background knowledge about the underlying hardware, in which case you probably barely need to read the specification in the first place. To deal with this vagueness, this article tries to teach this required background knowledge that specifications otherwise omit.

Finally, it’s important to realize that contemporary compute APIs like OpenGL and DirectX are evolving at a fast pace. In 5 years, compute won’t look the same. In 10 years, compute capabilities will have evolved significantly in flexibility and performance. In 20 years, what we’re doing now will look primitive and uncivilized. If you’re reading this now, you’re probably going to see this evolution first-hand. Maybe you’ll be a user, and maybe you’ll be an implementer too. Either way, you have the potential to shape the future, so let’s think about how things should be rather than how things are.

Scaling with Hardware

Previously, developers could write non-parallel (scalar) code, and see their programs run significantly faster with every new generation of hardware. Unfortunately, as CPU designers run up against the physical limits of the universe, it has become more cost-efficient to, for example, create a CPU with two cores at the same (or lower) frequency, rather than trying to create a CPU with a single core at double the frequency.

There is an obvious trend of hardware designers deciding to replicate hardware resources rather than trying to speed up existing hardware resources. As before, we would like to write programs that improve in performance over time with new generations of hardware. To do this, we have to write programs that improve in performance as hardware resources become more and more replicated in the future.

Replicated hardware resources allow us to improve the performance our programs, but only in a specific kind of way. Seeing this requires a shift in mindset. Consider for example using a binary search tree or a hash table to optimize a linear search. Thinking about data structure improvements to improve search time is certainly important, but we also have to consider if and how these data structures would speed up the program if the user ran it with additional hardware resources.

In contrast to single-threaded optimizations, scaling with additional hardware requires distributing the data among available hardware resources. For example, with replicated comparison circuits, we can accelerate comparisons while searching by doing more than one comparison in parallel. With replicated arithmetic hardware, we can compute the hash code for more than one search query simultaneously. With replicated hardware for memory access, we can accelerate the access of the data structure we’re searching in. What these optimizations have in common is that they distribute the data of the input domain among hardware resources, an approach formally known as data-parallelism. Identifying ways to run your code in a data-parallel way is the key to writing code that scales with hardware.

The Red Herring: System-on-a-Thread

In the earlier days of game engine programming, developers were thinking about scaling with hardware by associating software systems to cores. For example, a game programmer might have decided to exploit a dual core CPU by running their graphics system on one core and their audio system on the second core, which is done by creating one operating system thread for each of these systems. This so-called “system-on-a-thread” solution has some serious scalability problems. If the audio thread finishes processing one frame of audio before the graphics thread finishes processing one frame of video, then the audio thread has to become idle while waiting for the graphics thread to catch up. In that period of time, only one out of two CPU cores is active, which means that computation power is left on the table.

Idling cores isn’t the only problem with system-on-a-thread. What happens if you want to add a physics system to the game engine, and the physics system also wants to run on its own thread? In this case, we have 3 system threads but only 2 CPU cores. This means we have over-subscription, which means that we have more threads than cores. This is undesirable, since the CPU cores have now become contended resources between the threads.

One core can only run one thread at a time, so the operating system will need to share each core’s running time among the threads assigned to it. When the operating system switches a core from running one thread to running another thread, an expensive context-switch happens. In the worst case, your program might spend more time context-switching between threads than doing useful work.

There is an important lesson to learn from this: Additional parallelism in software can only be useful if it can be efficiently allocated to hardware resources. When it comes down to it, software systems can’t use more circuits than what physically exists on the chip. Only hardware designers have the power to add additional circuits, and as a software designer you have to work with what you’re given.

The problems with system-on-a-thread underline the importance of data-parallelism. If we identify general ways of splitting up the data we’re working on among hardware resources, then we can efficiently split up work between available hardware, independently of how much hardware we really have.

Designing Parallel Programs

To design data-parallel programs, we need to understand what can and can’t be made parallel. As implied by the term “data-parallelism”, the goal is to divide the work into pieces that can be executed independently on the data that needs to be processed, which allows us to execute these pieces of work in parallel. By designing work that can run independently from other work, we guarantee that the execution of one piece of work won’t affect the results of another piece of work. This is important, since work that executes in parallel will likely not execute in a consistent order.

The Problem with Threads

Let’s take a moment to appreciate the problems that arise when parallel code doesn’t run in a consistent order.

Consider the following example program:

/* Thread 1 */           |  /* Thread 2 */
a = 1;                   |  a = 0;
printf("a == %d\n", a);  |

This program is made up of two threads. The order in which the two threads execute their code is unpredictable, since it depends on the inner workings of the C compiler, the operating system and the hardware.

In one run of the program, the execution might happen in the following order:

a = 0;
a = 1;
printf("a == %d\n", a); // -> prints "a == 1"

In this case, Thread 2 ran its first line of code first, then Thread 1 ran to completion. As a result, the program finished with “a” set to 1.

In another run of the program, the execution might happen in this different order:

a = 1;
a = 0;
printf("a == %d\n", a); // -> prints "a == 0"

In this case, Thread 1 ran its first line of code, then Thread 2 ran its first line of code, then Thread 1 ran its second line of code. As a result, the program finished with “a” set to 0.

What this short example shows is that multi-threaded programs can produce different results based on the order in which the code executes on different threads. This unpredictability is bad news for programmers, since it makes the code much harder to understand. To quote Edward A. Lee’s “The Problem with Threads”:

a folk definition of insanity is to do the same thing over and over again and expect the results to be different. By this definition, we in fact require the programmers of multi-threaded systems be insane. Were they sane, they could not understand their programs.

When the order of execution of different threads causes the results of the program to change, we call that a “race condition”, to mean that the conditions in which the program’s results differ if one thread reaches a certain point of execution before another. If we want to resolve a race condition, we can identify synchronization points in the code, which can be implemented using an operating system synchronization object like a mutex, a semaphore, a condition variable, a critical section, an interlocked (atomic) operation, and more.

The synchronization objects I just described are very interesting to study and extremely useful, but they also all have relatively high execution costs. Instead, we should aim to avoid needing synchronization at all, and this can be achieved if we design algorithms that avoid race conditions entirely. To do this, we first need to know how to formally identify race conditions.

Identifying Race Conditions

As we saw in the previous section, running two functions in parallel may result in race conditions, which can lead to programs with indeterminate results. There are two main culprits to this race condition:

  • Shared memory
  • Side-effects

Shared memory was a problem in the previous example, since the race condition we observed was partly caused by the fact that the memory for the variable “a” was shared between the two threads. If the code was re-designed to make the variable “a” only exist in one thread (as opposed to being shared between the two threads), then it would be impossible for the variable to change due to the execution of a different thread, since those other threads simply don’t have access to it.

Along with shared memory, the second cause of race conditions in the example we saw was that each thread had side-effects. For example, consider the following program:

/* Thread 1 */           |  /* Thread 2 */
b = a;                   |  c = a;
b += 1                   |  c -= 1;

In this program, both threads had access to the variable “a” through shared memory. However, unlike the program with a race condition that we saw earlier, the two threads only read the value of “a”, and never write to it. Since the shared memory is now read-only, there is no race condition. Regardless of the order in which the lines of code of the two threads above are executed, the final values of “a” and “b” and “c” are always the same.

Based on these observations, we can make a powerful statement: Threads may share memory without race conditions and without synchronization if the memory they share is read-only.

The fact that memory can be safely shared between threads if it is read-only is very interesting from the perspective of a programming language designer. For example, the Haskell programming language is designed so every variable is read-only after being written for the first time. Since this prevents functions from having side-effects, memory can be shared arbitrarily and Haskell programs can thus be highly automatically parallelized. Haskell has other problems that make it arguably hard to use in real-life code, but it’s interesting and eye-opening to study for programmers of all walks of life.

Despite the elegance of languages like Haskell, our day-to-day life using programming languages like C requires us to both read and write shared memory. If we want to avoid race conditions in our programs, then we should ensure that these reads and writes happen in a consistent order. To do this, we must identify and handle the data hazards that happen when reads and writes are done to shared memory. For example, if one thread writes to shared memory then another thread reads that shared memory, that subsequent read should see the value that was just written. In other words, before being able to subsequently read from memory after a write, the reader must wait for the previous writer to finish writing. If the reader doesn’t wait for the writer to finish, the reader might read the previous value of that memory (as opposed to the new one it should have read). Even worse, the reader might read some data that is a Frankenstein-like half-half between the previous value and the new value, also known as a “torn read”.

Since data-hazards are concerned with handling the order of reads and writes, we can exhaustively analyze every possible such scenario:

  1. Read-After-Read (RAR)
  2. Read-After-Write (RAW)
  3. Write-After-Read (WAR)
  4. Write-After-Write (WAW)

The case of Read-After-Read is the one scenario where there is no data hazard. As described previously, there are no race conditions when shared memory is only read and never written.

In the case of Read-After-Write, it’s important for the reader to wait for the writer to finish, otherwise the reader might not see the updated value, or might get a torn read, as was explained earlier.

In the case of Write-After-Read, it’s important for the writer to wait for previous readers to finish, since otherwise the readers might see the new value instead of the old value they were supposed to see. If the writer doesn’t wait for readers to finish, it’s also possible for the writer to cause a torn read by overwriting a value while a reader is only halfway done reading it.

Finally, in the case of Write-After-Write, it’s important that the final value of the data reflects the second write instead of the first write. If this relationship isn’t enforced, successive reads will see the old value (from the first write) instead of the new value (from the second write). Therefore, a writer should wait for previous writers to finish before doing their own write.

Now that we’ve identified all possible data hazards that can cause race conditions, we can more formally identify race conditions in multi-threaded code. Awareness of these data hazards will allow us to design algorithms that avoid them when possible. Furthermore, when we can’t avoid a data hazard, we’ll know that we need to use some kind of synchronization to make sure reads and writes happen in the right order.

Designing Data-Parallel Algorithms

With the knowledge of data hazards, we can now think about designing algorithms that avoid them. This section reviews a variety of methods useful for designing data-parallel algorithms.

Independent Uniform Work

As an initial example, let’s consider the following algorithm:

void times_two(float* data, int n)
    for (int i = 0; i < n; i++)
        data[i] = data[i] * 2;

Running this code in parallel is fairly easy. Each iteration of the loop is independent from the other iterations, so we can simply distribute the “n” iterations equally among the hardware’s available resources, and that should give ideal hardware usage. This is possible because no memory is shared between the different iterations of the loop, which means that no synchronization is necessary and that the code can be trivially parallelized. We can parallelize the above code among the CPU’s cores trivially with an API like OpenMP, as follows:

void times_two(float* data, int n)
    // Runs the for loop below in parallel, distributed among CPU cores.
#pragma omp parallel for
    for (int i = 0; i < n; i++)
        data[i] = data[i] * 2;

By the way, OpenMP is a C/C++ compiler extension that allows you to automatically parallelize code. OpenMP is built into most C++ compilers these days, including GCC, clang, and Visual C++. You can enable OpenMP by simply passing a compiler flag on the command line, or by setting a project property in Visual Studio, so you should feel free to experiment with it.

Non-Independent Uniform Work

Let’s consider a slightly more challenging example: An algorithm that “blurs” an array. As a reminder, a basic box blur (ie. on an image) is implemented by computing the average of a pixel and its neighbors. In this case, we’ll take the average of a float and its 2 neighbors in the 1D array (the float before and the float after).

If we wanted to blur a single float in the data array, we might do something as follows (ignoring bounds checking):

    data[i] = (data[i - 1] + data[i] + data[i + 1]) / 3.0f;

The code above cannot simply be used as the body of a loop over the whole array, because the iterations are not independent from each other. The order in which the iterations are executed will change the output of the algorithm, since the “data[i]” written in one iteration will be read by its neighbors, which will change the result of its neighbors’ computations. If we write to data[i] only after the the reads from data[] have been done for all iterations, then the answer will be correct, since the writes will not interfere with the reads. We could see this problem as a Write-After-Read data hazard, since we need to wait for all the reads from data[] to be complete before we do any writes to it.

The Write-After-Read data hazard happens in this example because we’re trying to read and write to the same memory location in a single step. To remove the data hazard, we can use an additional memory allocation. By creating a separate array to store the results, there are no more data hazards. The input array is read-only, and the output[i] is not shared between iterations, so the code can now be easily distributed among available hardware resources in parallel.

Note 1: It is assumed that the “input” and “output” arrays are not overlapping in memory. For example, if both pointers pointed to the same memory, we would have the exact same problem as before.

Note 2: For readability, the bounds-checking for the beginning and end of the array is omitted. If necessary, this could be fixed by handling the first and last iteration as a special case, or by making the data[] array bigger by adding one element before the beginning and one after the end.

void blur(const float* input, float* output, int n)
#pragma omp parallel for
    for (int i = 0; i < n; i++) {
        output[i] = (input[i - 1] + input[i] + input[i + 1]) / 3.0f;


To take the blur example one step further, let’s consider how we can make a “blurrier” result by blurring the results of the blur. In other words, we want to run the blur algorithm multiple times on the array in a loop. We can easily implement this using the blur function defined previously, by alternating between which buffer is used as input and which buffer is used as output. This could be done as follows:

void blur_4_times(float* input, int n)
    // note: assuming borders are handled properly inside blur()
    float* tmp = new float[n];

    float* read_buffer = input;
    float* write_buffer = tmp;

    for (int blur_i = 0; blur_i < 4; blur_i++)
        blur(read_buffer, write_buffer, n);
        std::swap(read_buffer, write_buffer);

    delete[] tmp;

This code alternates between reading and writing between two buffers. After each iteration, the buffer being read from and the buffer being written to are swapped. This is known as “ping-ponging”, to mean that the direction of the copying between the two buffers is alternating like how a ping-pong ball alternates between players on a ping-pong table. ๐Ÿ™‚


In the next example, we’ll be looking at how to implement a reduction. A reduction is an operation that turns a set of values into a single value, like for example computing the minimum element in an array. Computing the minimum of an array sequentially without any multi-threading is pretty darn easy. For reference, it might look something like this:

float minimum(const float* data, int n)
    // result is undefined if the array passed in is empty
    assert(n >= 1);

    float curr_minimum = data[0];
    for (int i = 1; i < n; i++)
        if (data[i] < curr_minimum) {
            curr_minimum = data[i];

    return curr_minimum;

In the algorithm above, “curr_minimum” is potentially read and written by every iteration for the loop, meaning there exists a data hazard between each iteration of the loop. This data hazard prevents us from safely running this loop in parallel. In order to find the minimum in parallel, we can implement this algorithm in multiple passes. The big idea is to split the array into many smaller arrays, then find the minimum of each of these smaller arrays in parallel. Once the minimums of the smaller arrays are computed, we can find the minimum of these minimums. This subdivision of the work can be applied repeatedly in order to implement a highly parallel minimum algorithm, as shown below.

float minimum(const float* data, int n)
    assert(n >= 1);
    assert(is_power_of_two(n)); // for simplicity

    float* mins = new float[n];
    memcpy(mins, data, sizeof(float) * n);

    int curr_n = n;

    for (int pass = 0; pass < log2(n); pass++)
        #pragma omp parallel for
        for (int i = 0; i < curr_n / 2; i++)
            mins[i] = std::min(mins[2 * i], mins[2 * i + 1]);

        curr_n = curr_n / 2;

    float result = mins[0];

    delete[] mins;

    return result;

In the example above, the computation shrinks by half with each pass, since each pass computes the minimum for a pair of inputs from the previous iteration. The number of passes required can be lowered if each iteration of the inner loop computes the minimum of more than 2 values, so you can make a tradeoff between the number of passes and the amount of serial work per pass. This is something you need to tweak based on the hardware you’re working with and the size of your input.

Conclusion/To Be Continued

In this post, the motivation for designing data-parallel algorithms was established, based on the idea that we need to write software that can scale in performance as more hardware (cores, ALUs, etc) is added to processors. From there, we explored the concept of data hazards, showing how they are an obstacle to the implementation of data-parallel algorithms. Finally, a few common ways to implement data-parallel algorithms have been introduced.

In the next post, I’ll talk about how hardware (CPUs, GPUs, etc) can be designed to enable to rapid execution of data-parallel algorithms. Once we understand how data-parallel programs work from the conceptual side and from the hardware side, we’ll be able to talk about designing programming languages that allow efficient implementation of data-parallel algorithms based on principles that can be applied to a wide variety of hardware.

A Task Is Like A Function

In order to effectively use a multi-core CPU, a decomposition of code into tasks must be made. By decomposing code into tasks, the tasks can be distributed among CPU cores by a scheduler. Tasks can have dependencies on other tasks, meaning that the tasks execute in a partial order. This partial order can be represented as a graph, where each task has predecessors and successors. A task can only execute when all its predecessors have completed.

To implement this, we can explicitly create a graph of tasks, with edges between tasks to identify predecessor/successor relationships. The problem with a general task graph system is that it’s not productive to work with. Maybe it’s okay for some purposes to have a GUI interface to build a task graph, but that’s too tedious for everyday code. Imagine if you had to explicitly create nodes and edges when implementing a single-threaded function call graph, it would be so tedious! So, how can a productive task graph interface be built?

Languages like C/C++ were originally designed for function decomposition. That means you split your code into functions, and call functions from other functions. eg:

void a() {
void b() {
void main() {

This is the natural way to write C/C++ code, and this is how single-threaded game engines are written. In the task-parallel world, we want to enhance function decomposition by adding task decomposition. In other words, we structure the code as tasks that spawn other tasks.

There are benefits of matching task decomposition to function decomposition:

  • It matches the natural code structure of C/C++ (tasks spawn tasks vs functions call functions)
  • It provides a clear upgrade path for single-threaded code (child functions become child tasks)
  • We can leverage C/C++ programmers’ built-up intuition on how to write code.

So, what’s the difference between tasks and functions?

  • Functions are called, then return. Tasks are spawned, then complete.
  • Tasks are a superset of functions. A function is task that is spawned and waiting for completion immediately.
  • Task spawning is more relaxed than function call.
    • A spawned task may not necessarily begin until its completion is explicitly waited on.
    • Note: “enqueued” tasks (vs. “spawned”) don’t need an explicit wait, but still start asynchronously.
  • Task completion is more relaxed than function return.
    • A task may not necessarily complete until its completion is explicitly waited on.

The relaxations of tasks allow them to run in parallel, but this relaxation must be constrained to enforce a partial order on execution. The requirement of partial order comes in 3 main forms, each more general than the last:

  • Task Path: 1 predecessor, 1 successor.
  • Task Tree: N predecessors, 1 successor.
  • Task Directed Acyclic Graph (DAG): N predecessors, N successors.

The Task Path can be implemented as an ordinary sequence of function calls. This may also be integrated in the task system as a sequence of continuation tasks, where each task defines its own continuation.

The Task Tree is implemented as a task that spawns a group of child tasks, then waits for them to complete. This is good for recursive divide-and-conquer algorithms, allowing tasks to be created dynamically based on the distribution of work.

The Task DAG is implemented by spawning tasks only when all their predecessors have completed. This can be implemented by storing a counter of incomplete predecessors in the successor task. When a predecessor task completes, it decrements the counter in each successor. The predecessor that decrements the counter to zero then spawns the successor.

These three forms of task relationships can be implemented using Intel TBB’s task system.

Note: TBB’s task-parallelism enables multi-core scaling through a work-stealing scheduler. To better understand this, see:

The Task Path is implemented either as an ordinary sequence of function calls (independent of TBB), or using continuation tasks.

The Task Tree is implemented by spawning child tasks then waiting for them to finish. It’s the poster-child use case of TBB tasks.

The Task DAG is implemented by reusing the reference count built into TBB tasks as the number of incomplete predecessors. The predecessor tasks must decrement and check the reference count and possibly spawn each successor, as explained previously.

It seems like we have all the building blocks to build task graphs with ordinary TBB tasks, but the API is a bit too low level. It may be possible to build a simpler interface on top of TBB, that would also simplify common use cases. I would like this API to look closely like an ordinary call graph, for the reasons stated at the start of this writeup. Designing this API is a topic for another post.

PS: TBB is designed for non-blocking tasks, meaning it is not suitable for performing I/O or waiting for asynchronous operating system events. With Intel TBB’s task scheduler, a blocking task results in an idle core in the best case, and a deadlock in the worst case. To handle asynchronous I/O and events, I would recommend using the Windows Threadpool task system, since its integration in the operating system helps it make intelligent scheduling decisions by knowing about the state of I/O operations and events. TBB tasks can be executed asynchronously in response to OS events using enqueued tasks.

High DPI Rendering

In this article I’ll explain how to handle “High-DPI” in games, specifically referring to the relevant Windows APIs, some Apple APIs, and the SDL library. I’ll also explain what it means to write a “DPI aware” app in layperson’s terms, which might be helpful regardless of whether you’re using SDL or not. Finally, I’ll show how you can use DPI scale in typical graphics applications, and briefly look at some advanced DXGI features for compositing scaled layers of rendering.


With the increase of resolution of computer monitors, we’re seeing more and more so-called “high dpi” or “retina” displays. What this means is that the monitor has a high number of “dots per inch”, where “dot” effectively means pixel. For example, the screen of a Microsoft Surface Pro 4 is probably around the same size as an older laptop you’ve owned (12.3 inches), yet it has a resolution of 2736 by 1824, which is probably much more than that older laptop.

In other words, the number of pixels has increased, while the dimensions of the monitor have stayed the same. What does this mean for graphics development? Well, it means that your visuals have shrunk considerably. You might be assuming 96 pixels per inch, which is Windows’ default, meaning that 640 pixels are about 6.7 inches wide. On the other hand, with 267 pixels per inch (on Surface Pro 4), your 640 pixels are about 2.4 inches wide, and the text in your UI has probably become small enough to cause eyestrain just from reading it.

So, how do we address this issue? The answer is so-called “DPI awareness”. However, to understand what that means, we have to realize that the word “DPI” has become unfortunately overloaded, leading to much confusion.

High DPI: The Truths and the Lies

Let me give it to you straight: When people say “DPI”, they might mean one of two things:

  1. The number of pixels that physically exist per inch on your monitor.
  2. The display zoom set in the Windows control panel.

When it comes to software development, you really only have to care about the second meaning, which is mostly unrelated to the density of pixels on your specific monitor. You can tweak this display zoom by opening the control panel’s “Display Settings”, and playing with the slider shown below:


The purpose of this slider is to zoom in all apps on your computer, to make it easier to read text and click on buttons by countering the shrinking effect described earlier in this article. Now here’s the confusing part: This display zoom is a percentage, not a number of dots per inch. Again, this zoom factor isn’t about the number of pixels on your screen, it’s simply a display zoom as a percentage, and nothing more.

Reading and Interpreting the Display Zoom

Despite “DPI” being just a zoom factor, if you ask Windows for the DPI using eg. GetDpiForWindow(), you’ll get a number in dots per inch in return. You can get the display zoom that corresponds to this number by dividing it by “96”, which is the default DPI in Windows. For example, if I set the display zoom in control panel to 200%, then GetDpiForWindow() will return 192. If I divide 192 by 96, that gives me “2.0”, which is the display zoom (corresponding to 200%). As a result, I know that I should scale my UI by 2x.

Your App’s “DPI Awareness”

Every app in Windows has a DPI awareness level. This was done for backwards compatibility for apps that were written before DPI was considered. If an app is not DPI aware, that means Windows will automatically scale up the app, which it does by effectively lying to your app about its size in pixels on the display. That means you might ask for a window that is 1000 pixels wide, but get one that is actually 2000 pixels wide. Your app will happily render at 1000 pixels wide, then Windows will automatically scale up your rendering to 2000 pixels.

On the other hand, if your app is DPI-aware, Windows will not lie to it about its size in pixels. If Windows says your app is 1000 pixels wide, then it’s 1000 pixels wide, period. It’s business as usual, with the only added detail that you should be trying to scale up your rendering to match the display zoom that the user requested in the control panel.

To be fully DPI aware, you should also act in response to DPI changes at runtime, which might happen if the user tweaks the slider in control panel while your program is running, or if the user drags your window from one monitor to another monitor that has a different display zoom setting.

Handling DPI in SDL

There are two main things you need to do to handle DPI in SDL. First, you should mark your app as “DPI aware”, and second, you should read the current display zoom to and interpret it in your renderer.

Setting DPI Awareness

There are 2 main ways to set DPI awareness: Using the manifest/Info.plist (for Windows and Apple, respectively), or by setting DPI awareness programmatically.

Setting the DPI awareness through a Visual Studio project is extremely easy. Open your project properties by right-clicking on your project in your solution and picking “properties”, then go to the menu shown below and set the “DPI Awareness” flag.


For Apple devices, you set the NSHighResolutionCapable flag in your app’s Info.plist file, as shown below. (Image from Apple’s documentation on the topic.)

If you want to set the DPI awareness programmatically, I have some bad news for you. It’s currently possible to set DPI awareness programmatically on Windows, but not for Apple. If that’s not a deal breaker for you, you can set DPI awareness using “SetProcessDpiAwareness()”, which you can call by including the header <ShellScalingAPI.h> and linking “Shcore.lib” to your project. Note that this call must be the first Window management-related call in your program, so you should probably call this at the very top of your main. The following code snippet might be helpful. You can wrap it in a function in a separate file if you want to contain the symbol pollution caused by Windows.h.

#include <Windows.h>
#include <ShellScalingAPI.h>
#include <comdef.h>

// SDL wants main to have this exact signature
extern "C" int main(int argc, char* argv[])
    if (FAILED(hr))
        _com_error err(hr);
        fwprintf(stderr, L"SetProcessDpiAwareness: %s\n", err.ErrorMessage());

        fprintf(stderr, "SDL_Init: %s\n", SDL_GetError());

    // TODO: Create your window and stuff

Furthermore, if you’re using SDL, you should pass the SDL_WINDOW_ALLOW_HIGHDPI flag to SDL_CreateWindow, which lets you use SDL’s APIs for managing DPI. (See documentation for SDL_CreateWindow).

It’s kind of unfortunate that DPI can’t be set programmatically on Apple devices, since it makes it harder to write cross-platform apps with SDL, requiring instead that custom manifests are built per platform. You should probably create proper per-platform manifests if you’re shipping a production app, but this extra work hurts productivity for small cross-platform projects. If somebody from Apple is reading this, it would be sweet if you could do something about it. ๐Ÿ˜‰

Reading the Display Zoom

SDL has a function to read the current display DPI (which again, is the display zoom and nothing more.) This function is SDL_GetDisplayDPI. It reports a horizontal, vertical, and diagonal DPI. Windows guarantees the horizontal and vertical DPI are always the same, so you can basically ignore everything but the horizontal DPI.

There’s still something missing though. Again, we don’t actually want the DPI itself, we want the display zoom as a percentage (or a ratio). To do this, we have to get the DPI using SDL, then divide it by the default DPI. Unfortunately, SDL has no function to get the default DPI, and the default DPI is actually different depending on your OS. Windows’ default DPI is 96, and Apple’s default DPI is 72. For this reason, I wrap SDL_GetDisplayDPI in my own function that also returns the default DPI. Also, SDL_GetDisplayDPI can fail if DPI information is not available, so I handle that by returning the default DPI. This is implemented as follows:

void MySDL_GetDisplayDPI(int displayIndex, float* dpi, float* defaultDpi)
    const float kSysDefaultDpi =
#ifdef __APPLE__
#elif defined(_WIN32)
        static_assert(false, "No system default DPI set for this platform.");

    if (SDL_GetDisplayDPI(displayIndex, NULL, dpi, NULL) != 0)
        // Failed to get DPI, so just return the default value.
        if (dpi) *dpi = kSysDefaultDpi;

    if (defaultDpi) *defaultDpi = kSysDefaultDpi;

Display Zoom for Window Creation

If you ask DPI-aware SDL to create a window that is sized of 640 by 480, you’ll still get a tiny window as I explained at the start of the article. Maybe that’s what you want, since it’s actually doing exactly what you asked. On the other hand, I personally like having an interface where I can ask for 640 by 480 window, and it gives me a window that has the same size in inches that a 640 by 480 window would have on a screen with a default DPI.

This is mainly just convenient for me, since I’ve built up certain expectations about how big a 640 by 480 or a 1280 by 720 window should be on my screen. If you want to do this, you can get the display zoom of the monitor your window is on, then scale up your window’s size based on the monitor’s display zoom. If you assume the window will be created on the primary monitor, you might use something like the following to create your window:

// Scale window according to DPI zoom
int windowDpiScaledWidth, windowDpiScaledHeight;
    int windowDpiUnscaledWidth = 640;
    int windowDpiUnscaledHeight = 480;

    float dpi, defaultDpi;
    MySDL_GetDisplayDPI(0, &dpi, &defaultDpi);

    windowDpiScaledWidth = int(windowDpiUnscaledWidth * dpi / defaultDpi);
    windowDpiScaledHeight = int(windowDpiUnscaledHeight * dpi / defaultDpi);

SDL_Window* window = SDL_CreateWindow(
    "dpi test",
    windowDpiScaledWidth, windowDpiScaledHeight,

By the way, allow me to repeat that the DPI might change at runtime if the user plays with the display zoom in the control panel, or if the user drags the window from one monitor to another monitor that has a different display zoom setting. For this reason, you might want to poll the DisplayDPI at every frame and resize your window in response. This is kind of difficult to do properly though, since switching the size of the window might make it suddenly pop onto another monitor and cause some kind of infinite loop, who knows.

The “proper” way to handle DPI changes at runtime is by responding to the Windows event WM_DPICHANGED, which gives you both the new DPI and a suggested new position and size for the window. Last I checked, SDL doesn’t handle WM_DPICHANGED, which is unfortunate, and would be a good thing to patch in.

Using Display Zoom in a Renderer

After marking your app as DPI aware, you need to scale up your rendering according to the display zoom. This section explains how to apply this display zoom to typical graphics scenarios.

Display Zoom for UI

If you have some kind of UI in your app, you probably want to scale that up based on the display zoom, since the user will have trouble reading any text otherwise. You might also want to create multiple versions of the artwork in your UI at different scales, to avoid the artifacts that come from automatic scaling algorithms. The goal here is to give the UI a sharp and crisp look at any display zoom, since blurry UIs tend to look pretty bad.

If your UI is designed to have a “pixely” look, like for example ImGui, then scaling will generally look bad unless you round to an integer scale, so keep that in mind as you decide how to interpret the display zoom.

On the topic of ImGui, it sorta has DPI awareness built-in through the “io.DisplayFramebufferScale” value. This value isn’t actually interpreted at all by ImGui, it’s basically just a way for you to pass your desired DPI scale from your window management code to your rendering code. ImGui itself (and its mouse handling code) assumes it lives in a world where scaling doesn’t exist, so even if you scale up the UI in your rendering code, you still need to “unscale” mouse coordinates before passing them to ImGui. Therefore, your code to pass mouse coordinates to ImGui might look as follows:

io.MousePos.x = actualMouseX;
io.MousePos.y = actualMouseY;
// Scale according to framebuffer display scale
io.MousePos.x /= io.DisplayFramebufferScale.x;
io.MousePos.y /= io.DisplayFramebufferScale.y;

Display Zoom for Scene Rendering

For the best look possible, you can ignore that the monitor has a high DPI and just render your scene at the same pixel resolution as the window it’s being displayed in. This only works if your app is DPI aware, since otherwise the OS will be automatically scaling up your app and you’ll see scaling artifacts anyways. If rendering at a 1-to-1 resolution is too expensive, as is likely with any heavy rendering work on a 4K display, you might want to render your scene at a low resolution and do the scaling yourself. This might improve your frame rate, or save battery life. You might do this framebuffer scaling using glBlitFramebuffer, as shown in a side-tip in my article on Reversed-Z in OpenGL.

While it can be important to lower the number of pixels rendered on high DPI displays for performance reasons, recall that’s not what the “display zoom” slider is about. The display zoom slider is supposed to scale the visuals in the program to make them easier to see. If your program responds to display zoom changes by increasing the size of its window, that’s enough to cause your scene to occupy a greater number of square inches on the user’s screen, so mission accomplished. This is independent from whether or not you decide to do your rendering at a lower resolution than the number of pixels on the screen. If you want to scale your rendering without resizing the window, then you might want a camera zoom feature in your renderer that is independent of the display zoom slider. You might think of ways to incorporate the display zoom into your 3D camera’s logic, but that’s a topic of user interface design, and should probably be handled on a case-by-case basis.

Composing Scaled Layers in Hardware

Direct3D 11.2 allows you to compose layers at different resolutions in hardware, as documented here: Swap chain scaling and overlays (MSDN). For example, you can render a high resolution UI and a low resolution scene in separate layers, then rely on the hardware to compose these two layers together rather than doing it yourself. This might be a more efficient way to handle DPI scaling in a rendering engine, although it relies on hardware features.

I’m not sure if there’s a way to use hardware overlays in OpenGL, at least on desktop. wgl has functions for overlay layers, but they’re old and busted and have been deprecated for a long time. This might be another good reason to use WGL_NV_DX_interop2 to do your OpenGL rendering within a DXGI swap chain on Windows, since that allows you to use modern Windows presentation features without having to rewrite your OpenGL code. Although that extension is marked “NV”, it is now supported by all desktop GPU vendors.

Unfortunately, it’s not clear how to use WGL_NV_DX_interop2 properly from its documentation. I have some quick tests here if you’re interested: OpenGL-on-DXGI (GitHub). If you’re part of Microsoft or Khronos or one of the IHVs, your help might be needed here. ๐Ÿ™‚


Hopefully now you know more about writing DPI aware apps, especially using Windows and/or SDL. Having a sharp resolution on high DPI screens tends to look really good, so I hope this article helps get you there. Thank you for reading!

Reversed-Z in OpenGL

Maybe you’ve heard of Reversed-Z:

It’s a pretty good way to get more precision out of your depth buffer. Very useful for games with long view distances, like for example Just Cause 2 (as shown in a link above).


Reversed-Z visualized: Near objects have a higher Z value than far objects.

So, how to use it in OpenGL? Here’s a no-nonsense step-by-step guide.

Step One: Set Clip Space Conventions

Reversed-Z is designed for clip-space Z values in the range [0,1], not [-1,+1]. OpenGL’s default convention is [-1,+1], but you can override that using glClipControl:


I recommend sticking this line of code at the start of your program, and never changing the clip conventions after that. Give up entirely on the [-1,+1] convention, it’s hands-down objectively worse than the [0,1] convention when it comes to precision. It’s a good decision even if you’re not using Reversed-Z. Anyways, when you sample from a depth texture in OpenGL, you already get a value between 0 and 1… so switching to the [0,1] convention for clip coordinates will make everything more consistent.

glClipControl is an OpenGL 4.5 feature. If you don’t have OpenGL 4.5, it might still be available as an extension (see: hardware supporting GL_ARB_clip_control). Therefore, you could use something like the following code snippet:

GLint major, minor;
glGetIntegerv(GL_MAJOR_VERSION, &major);
glGetIntegerv(GL_MINOR_VERSION, &minor);
if ((major > 4 || (major == 4 && minor >= 5)) ||
    glClipControl(GL_LOWER_LEFT, GL_ZERO_TO_ONE);
    fprintf(stderr, "glClipControl required, sorry.\n");

Step Two: Create a Floating Point Depth Buffer

The whole Reversed-Z thing is designed for floating point depth, not fixed point. That means you should be using a floating point depth buffer.

You can follow the FBO setup code below as an example:

int width = 640, height = 480;
GLuint color, depth, fbo;

glGenTextures(1, &color);
glBindTexture(GL_TEXTURE_2D, color);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_SRGB8_ALPHA8, width, height);
glBindTexture(GL_TEXTURE_2D, 0);

glGenTextures(1, &depth);
glBindTexture(GL_TEXTURE_2D, depth);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_DEPTH_COMPONENT32F, width, height);
glBindTexture(GL_TEXTURE_2D, 0);

glGenFramebuffers(1, &fbo);
glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, color, 0);
GLenum status = glCheckFramebufferStatus(GL_FRAMEBUFFER);
    fprintf(stderr, "glCheckFramebufferStatus: %x\n", status);
glBindFramebuffer(GL_FRAMEBUFFER, 0);

Side tip: Copying a Framebuffer to a Window

If you didn’t know, you don’t need to create a depth buffer when you create your window. You can render your scene (and depth buffer) into an offscreen FBO, then copy to your window only at the end. For example:

glBindFramebuffer(GL_FRAMEBUFFER, fbo);
// TODO: Render scene
glBindFramebuffer(GL_FRAMEBUFFER, 0);

glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, 0); // default FBO
    0, 0, fboWidth, fboHeight,
    0, 0, windowWidth, windowHeight,
glBindFramebuffer(GL_FRAMEBUFFER, 0);

With this approach, you can also be rendering your scene at a lower resolution than your window, since the call to glBlitFramebuffer will scale up your rendering. You might find this useful if you’re on a 4K display, but still want to render at a lower resolution for improved frame rate or battery power savings. This also saves you the trouble of creating a window in SRGB mode, which takes away further complexity from your window management code.

You can also use glBlitFramebuffer to convert a multi-sampled framebuffer into a single-sampled framebuffer, but you’re not allowed to do scaling simultaneously with resolving multi-samples in a call to glBlitFramebuffer, so you might have to call glBlitFramebuffer twice: Once to resolve multisamples, and once again to copy and scale up to your window. At this point, it might become interesting to do all this through a single fragment shader instead. You can simultaneously resolve and scale if you use the GL_EXT_framebuffer_multisample_blit_scaled extension (see: hardware supporting GL_EXT_framebuffer_multisample_blit_scaled).

Anyways, back to the whole Reversed-Z thing…

Step Three: Clear your Depth Buffer to Zero

With a normal depth test, you might have been clearing the depth buffer to 1, since that was the “far” value. When you call glClear(GL_DEPTH_BUFFER_BIT), it clears the depth buffer using the last set value of glClearDepth, which is 1 by default. On the other hand, with Reversed-Z, the “far” value is now 0, so you have to use the proper clear depth value:

glBindFramebuffer(GL_FRAMEBUFFER, fbo);


Step Four: Flip your Depth Comparison to GREATER

As implied by its name, using Reversed-Z means that far depth values are represented by smaller numbers. That means you need to switch your glDepthFunc from GL_LESS to GL_GREATER, as shown in the code below. In this example code, I also reset the comparison and depth state back to OpenGL defaults, so the state doesn’t leak into other code that might not be doing Reversed-Z, or that might not be using depth testing.

// TODO: Draw your scene

Step Five: Update your Projection Matrix

Reversed-Z requires a slightly different projection matrix than what (for example) gluPerspective creates. What follows is some code you can use to create this new projection matrix. It uses the “right-handed” coordinate system convention, meaning that the Z axis points out of your computer screen. Alternatively, that means the objects in front of your camera have a negative Z value in view space. This is the OpenGL convention.

Note that this matrix doesn’t only reverse the Z, it also sets the far plane to infinity, which works well for extremely large view distances. You can find the derivation of this matrix in the following article:

glm::mat4 MakeInfReversedZProjRH(float fovY_radians, float aspectWbyH, float zNear)
    float f = 1.0f / tan(fovY_radians / 2.0f);
    return glm::mat4(
        f / aspectWbyH, 0.0f,  0.0f,  0.0f,
                  0.0f,    f,  0.0f,  0.0f,
                  0.0f, 0.0f,  0.0f, -1.0f,
                  0.0f, 0.0f, zNear,  0.0f);

You can call this code as follows:

glm::mat4 proj = MakeInfReversedZProjRH(glm::radians(70.0f), (float)width/height, 0.01f);

glUniformMatrix4fv(PROJECTION_MATRIX_LOCATION, 1, GL_FALSE, value_ptr(proj));
// TODO: glDraw*()

Note that I’m referring to the OpenGL Mathematics library “glm”, which by OpenGL convention uses column-major order for the arguments of its constructor. That means the first 4 inputs to the constructor are actually the first column of the matrix. Might be important if you’re translating to a different matrix library.

That’s it?

That’s it!

In Summary

  1. Set your depth clip conventions to [0,1] using glClipControl.
  2. Create and use a floating point depth buffer.
  3. Clear your depth buffer to 0 instead of the default of 1.
  4. Use GL_GREATER instead of GL_LESS for your depth test.
  5. Use a projection matrix that flips the depth.

What about DirectX?

There’s almost no difference in implementing this in DirectX. First, you don’t need glClipControl because the [0,1] convention is already the default in DirectX. Second, if you’re following DirectX conventions and using a “left-handed” convention for the view, then just turn the “-1” into “+1” in the projection matrix. You also need to do this in OpenGL if you’re using left-handed conventions, meaning that the camera’s Z axis goes into your screen, or equivalently that you consider objects to be in front of the camera if their Z value in view space is positive.

OpenGL Renderer Design

Lately I’ve been writing lots of OpenGL programs for course projects and for-fun rendering side-projects. I’ve started to notice an overall design that works pretty well for these kinds of projects, and I’m going to document it here.

This design matches closely the idea of “Model-view-viewmodel” aka. “MVVM”. If you’re unfamiliar with this term, be aware that the words “model” and “view” here aren’t talking about 3D models or cameras or transformation matrices. MVVM is an object-oriented architectural pattern, similar to MVC (“Model view controller”). MVVM maps to the design of a renderer as follows:

  • Model: The scene which the renderer should render.
  • View: The renderer that traverses the scene and displays it.
  • View model: The data the renderer uses internally to render the scene.

Another important inspiration of the design in this article is the relational database. While I don’t use an actual relational database like SQL in my projects, I think the principles of relational databases are useful to reason formally about designing the model in the MVVM, so I’ll talk a bit about that in this article too.

In my projects, the three parts of MVVM are split into two C++ classes: “Scene”, and “Renderer”. In short, the Scene stores the data to render, and the Renderer reads the Scene to render it. In this article I’ll first talk about how I design the Scene, and then I’ll talk about how I design the Renderer.


The “Scene” class is very bare-bones, it contains no member functions whatsoever. Treat this class like a database: It has a bunch of tables in it, the tables contain data that possibly refer to other tables in the database, you can read/write the objects in the tables, and you can insert or erase elements from tables.

In a typical renderer, you might have a table of all the materials, a table of all the meshes, a table of all the instances of meshes, a table of all the cameras, and etc. The specifics of these tables depends on your project, so if you’re doing a skeletal animation project you would also have a table of the skeletons, instances of skeletons, and so on.

I found that designing the schema of the database is a big part of the requirements gathering phase for a project. Once you nail the design of the schema, the code you have to write to manage your data becomes more obvious (“I just need to hook up all this information”). Later modifications to your code can be done by figuring out how your schema needs to change, which is made easier since the relationships between your objects lay bare all together in one place.

In practice, I implement the tables of the database as C++ containers. This allows me to use containers with the access and update patterns I need for the project I’m working on. For example, I might want the objects in a table to be contiguous in memory, or do insertion/deletion of objects in arbitrary order, or be able to efficiently lookup an object by its name.

The design of the Scene class involves many concepts from relational database design. Let’s take some time to consider how primary keys, foreign keys, and indexing apply to the design of the Scene.

Primary Keys and Foreign Keys

Table usually have a “primary key”, which can be used to identify each object in the table uniquely, and can also be used to efficiently lookup the object. When an object in one table wants to refer to an object in another table, it can do so by keeping a copy of the primary key of the referred object, a concept also known as a “foreign key”.

An example of this situation is a mesh instance keeping a reference to the mesh associated to it, and keeping a reference to the transform used to place the instance in the world. Each mesh instance would then have a foreign key to the mesh table, and a foreign key to the transform table.

Primary keys and foreign keys can be as simple as integer IDs, though some care is required if you want to avoid foreign keys becoming invalid when modifications are done on the table they refer to. For example, if you decide to use a std::vector as the container for a table, erasing an element from the vector might cause objects to get shifted around, which will invalidate foreign keys referring to them.

In order to support insertion and deletion, I use a container I call “packed_freelist”, based on bitsquid’s ID lookup table. This container gives each object an ID that won’t change as elements are inserted and deleted, and allows efficient lookup of objects from their ID. The packed_freelist also automatically packs all objects in it to be contiguous in memory, which is desirable for efficient traversal.

Another way to implement foreign keys is to update the foreign keys when their pointed-to table is modified, which might be automated with a more sophisticated scene database system that is aware of foreign key relationships. On the simpler side of the spectrum, if you know that a table will never change after initialization, or if you only support insertion through push_back(), you can be sure that your foreign keys will never be invalidated with zero extra effort. You might also be able to use a C++ container that supports inserting/erasing elements without breaking pointers, like std::set, then using pointers to those elements to refer to other tables.

There exists a lot of formalism about primary keys and foreign keys in the world of relational databases. You might be interested to explore this topic further if you’re interested in exploring the limits of this concept.


It’s nice to have primary keys and all, but sometimes you also want to lookup objects based on other attributes. For example, you might want to lookup all meshes with the name “crate”. In the worst case, you can iterate through all the meshes looking for it, which is not a big deal if you do this search infrequently. However, if you do this search often, you might want to accelerate it.

In database-speak, accelerating lookups is done by building an index on a table. You can store the table itself in a way that allows efficient lookups, like sorting all meshes by their name, which would allow binary searching. Storing the table itself in a sorted order is known as building a clustered index. Naturally, you can only have one clustered index per table, since you can’t keep a table sorted by two unrelated things at the same time. By default, your tables should support efficient lookup from their primary key.

If you want to have many indexes on a table, you can store an acceleration structure outside the table. For example, I could store a multimap alongside the table of meshes that lets me lookup mesh IDs by name in logarithmic time. The trouble is that every index needs to be kept in sync with the table, meaning that every additional non-clustered index adds further performance cost to insertions and deletions.

The Code

To give you an idea of what this looks like, I’ll show you some examples of Scene setups in my projects, with varying amount of adherence to the principles I explained so far in this article.

Exhibit A: “DoF”

I did a project recently where I wanted to implement real-time depth-of-field using summed area tables. The code for scene.h is at the bottom of this section of this blog post.

If you look at the code, you’ll see I was lazy and basically just hooked everything together using packed_freelists. I also defined two utility functions for adding stuff to the Scene, namely a function to add a mesh to the scene from an obj file, and a function to add a new instance of a mesh. The functions return the primary keys of the objects they added, which lets me make further modifications to the object after adding it, like setting the transform of a mesh instance after adding it.

You can identify the foreign keys by the uint32_t “ID” variables, such as MeshID and TransformID. Beyond maybe overusing packed_freelist, the relationships between the tables is mostly obvious to navigate. You can see I was lazy and didn’t even implement removing elements, and didn’t even write a destructor that cleans up the OpenGL objects.

Not deleting OpenGL objects is something I do often out of laziness, and it barely matters because everything gets garbage collected when you destroy the OpenGL context anyways, so I would actually recommend that approach for dealing with any OpenGL resources that stay around for the lifetime of your whole program. If you want to be a stickler for correctness, I’ve had success using shared_ptr, using shared_ptr as follows:

shared_ptr<GLuint> pBuffer(
    new GLuint(),
    [](GLuint* p){ glDeleteBuffers(1, p); });

glGenBuffers(1, pBuffer.get());

A small extension to this is to wrap the shared_ptr in a small class that makes get() return a GLuint, and adds a new member function “get_address_of()” that returns a GLuint*. You can then use get() for your normal OpenGL calls, and get_address_of() for functions like glGenBuffers and glDeleteBuffers. This is very similar to how ComPtr is used in D3D programs (with a very similar interface: Get() and GetAddressOf()).

This shared_ptr thing is not the most efficient approach, but it’s straightforward standard C++, and it saves you the trouble of defining a bunch of RAII wrapper classes for OpenGL objects. I find such RAII wrapper classes are usually leaky abstractions on top of OpenGL anyways, and they also force me to figure out how somebody else’s wrapper classes work instead of being able to reuse my existing OpenGL knowledge to write plain reusable standard OpenGL code.

I should point out that since my Scene class is implemented only in terms of plain OpenGL objects, that makes it easy for anybody to write an OpenGL renderer that reads the contents of the tables and renders them in whatever way they see fit. One thing you might notice is that I also supply a ready-built Vertex Array Object with each mesh, which means that rendering a mesh is as simple as binding the VAO and making a DrawElements. Note you don’t have to bind any vertex buffers or the index buffer, since OpenGL’s Vertex Array Object automatically keeps a reference to the vertex buffers and index buffer that were attached to it when I loaded the model (It’s true!). For a better explanation about how vertex array objects work, check out my Vertex Specification Catastrophe slides.

glDrawElements(GL_TRIANGLES, mesh.IndexCount, GL_UNSIGNED_INT, 0);

The pinch is that OpenGL’s Vertex Array Object defines a mapping between vertex buffers and vertex shader attributes. I don’t use glGetAttribLocation to get the attribute locations, since I want to separate the mesh from the shaders used to render it. Instead, I use a convention for the attribute locations, and I establish this convention using the “preamble.glsl” file, as described in my article on GLSL Shader Live-Reloading, which allows me to establish an automatic convention for attribute locations between my C++ code and all my GLSL shaders.

You can find this source file (and the rest of the project it comes from) here:

#pragma once

#include "opengl.h"
#include "packed_freelist.h"

#include <glm/glm.hpp>
#include <glm/gtc/quaternion.hpp>

#include <vector>
#include <string>

struct DiffuseMap
    GLuint DiffuseMapTO;

struct Material
    std::string Name;

    float Ambient[3];
    float Diffuse[3];
    float Specular[3];
    float Shininess;
    uint32_t DiffuseMapID;

struct Mesh
    std::string Name;

    GLuint MeshVAO;
    GLuint PositionBO;
    GLuint TexCoordBO;
    GLuint NormalBO;
    GLuint IndexBO;

    GLuint IndexCount;
    GLuint VertexCount;

    std::vector<GLDrawElementsIndirectCommand> DrawCommands;
    std::vector<uint32_t> MaterialIDs;

struct Transform
    glm::vec3 Scale;
    glm::vec3 RotationOrigin;
    glm::quat Rotation;
    glm::vec3 Translation;

struct Instance
    uint32_t MeshID;
    uint32_t TransformID;

struct Camera
    // View
    glm::vec3 Eye;
    glm::vec3 Target;
    glm::vec3 Up;

    // Projection
    float FovY;
    float Aspect;
    float ZNear;

class Scene
    packed_freelist<DiffuseMap> DiffuseMaps;
    packed_freelist<Material> Materials;
    packed_freelist<Mesh> Meshes;
    packed_freelist<Transform> Transforms;
    packed_freelist<Instance> Instances;
    packed_freelist<Camera> Cameras;

    uint32_t MainCameraID;

    // allocates memory for the packed_freelists,
    // since my packed_freelist doesn't have auto-resizing yet.
    void Init();

void LoadMeshes(
    Scene& scene,
    const std::string& filename,
    std::vector<uint32_t>* loadedMeshIDs);

void AddInstance(
    Scene& scene,
    uint32_t meshID,
    uint32_t* newInstanceID);

As an isolated example of how to load a mesh (using tinyobjloader), I’ll leave you with some example code. I hope this helps you understand how to use the Vertex Array Object API, which is tragically poorly designed.

// Load the mesh and its materials
std::vector<tinyobj::shape_t> shapes;
std::vector<tinyobj::material_t> materials;
std::string err;
if (!tinyobj::LoadObj(shapes, materials, err,
    "models/cube/cube.obj", "models/cube/"))
    fprintf(stderr, "Failed to load cube.obj: %s\n", err.c_str());
    return 1;

GLuint positionVBO = 0;
GLuint texcoordVBO = 0;
GLuint normalVBO = 0;
GLuint indicesEBO = 0;

// Upload per-vertex positions
if (!shapes[0].mesh.positions.empty())
    glGenBuffers(1, &positionVBO);
    glBindBuffer(GL_ARRAY_BUFFER, positionVBO);
        shapes[0].mesh.positions.size() * sizeof(float),
        shapes[0], GL_STATIC_DRAW);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

// Upload per-vertex texture coordinates
if (!shapes[0].mesh.texcoords.empty())
    glGenBuffers(1, &texcoordVBO);
    glBindBuffer(GL_ARRAY_BUFFER, texcoordVBO);
        shapes[0].mesh.texcoords.size() * sizeof(float),
        shapes[0], GL_STATIC_DRAW);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

// Upload per-vertex normals
if (!shapes[0].mesh.normals.empty())
    glGenBuffers(1, &normalVBO);
    glBindBuffer(GL_ARRAY_BUFFER, normalVBO);
        shapes[0].mesh.normals.size() * sizeof(float),
        shapes[0], GL_STATIC_DRAW);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

// Upload the indices that form triangles
if (!shapes[0].mesh.indices.empty())
    glGenBuffers(1, &indicesEBO);
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, indicesEBO);
        shapes[0].mesh.indices.size() * sizeof(unsigned int),
        shapes[0], GL_STATIC_DRAW);
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);

// Hook up vertex/index buffers to a "vertex array object" (VAO)
// VAOs are the closest thing OpenGL has to a "mesh" object.
// VAOs feed data from buffers to the inputs of a vertex shader.
GLuint meshVAO;
glGenVertexArrays(1, &meshVAO);

// Attach position buffer as attribute 0
if (positionVBO != 0)

    // Note: glVertexAttribPointer sets the current
    // GL_ARRAY_BUFFER_BINDING as the source of data
    // for this attribute.
    // That's why we bind a GL_ARRAY_BUFFER before
    // calling glVertexAttribPointer then
    // unbind right after (to clean things up).
    glBindBuffer(GL_ARRAY_BUFFER, positionVBO);
    glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE,
        sizeof(float) * 3, 0);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

    // Enable the attribute (they are disabled by default
    // -- this is very easy to forget!!)


// Attach texcoord buffer as attribute 1
if (texcoordVBO != 0)

    // Note: glVertexAttribPointer sets the current
    // GL_ARRAY_BUFFER_BINDING as the source of data
    // for this attribute.
    // That's why we bind a GL_ARRAY_BUFFER before
    // calling glVertexAttribPointer then
    // unbind right after (to clean things up).
    glBindBuffer(GL_ARRAY_BUFFER, texcoordVBO);
    glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE,
        sizeof(float) * 2, 0);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

    // Enable the attribute (they are disabled by default
    // -- this is very easy to forget!!)


// Attach normal buffer as attribute 2
if (normalVBO != 0)

    // Note: glVertexAttribPointer sets the current
    // GL_ARRAY_BUFFER_BINDING as the source of data
    // for this attribute.
    // That's why we bind a GL_ARRAY_BUFFER before
    // calling glVertexAttribPointer then
    // unbind right after (to clean things up).
    glBindBuffer(GL_ARRAY_BUFFER, normalVBO);
    glVertexAttribPointer(2, 3, GL_FLOAT, GL_FALSE,
        sizeof(float) * 3, 0);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

    // Enable the attribute (they are disabled by default
    // -- this is very easy to forget!!)


if (indicesEBO != 0)

    // Note: Calling glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebo);
    // when a VAO is bound attaches the index buffer to the VAO.
    // From an API design perspective, this is subtle.
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, indicesEBO);


// Can now bind the vertex array object to
// the graphics pipeline, to render with it.
// For example:

// (Draw stuff using vertex array object)

// And when done, unbind it from the graphics pipeline:

Exhibit B: “fictional-doodle”

This was a course project I did with my friend Leon Senft. We wanted to load a 3D mesh and be able to animate it in two modes: skeletal animation, and ragdoll physics. This project is way bigger than the DoF project, and it was also a much earlier project before I had the proper separation of MVVM in mind. Still, from a coding perspective, the project went very smoothly. Leon and I designed the schema together, and once we figured out how we were going to interface between the animation and the ragdoll physics systems, we were able to easily work on the project in parallel.

To decide how the two animation systems should communicate, we sketched a feedback loop flowchart that explained the order in which the two systems read and wrote their data at every frame, which mapped in a straightforward way to our code since the flowchart’s reads and writes corresponded to reads and writes of our scene database’s data. Below is the flowchart copied verbatim from the OneNote document we scratched it up in.


As you can see in the code at the bottom of this section, this schema has a lot more tables. You can also see some non-clustered indexes used to lookup textures by filename, using unordered_map. We used std::vector for most of the tables, mainly because I hadn’t developed the packed_freelist class yet, and because random insertion/deletion wasn’t a requirement we had anyways.

Again, this project worked with OpenGL in terms of plain OpenGL GLuints, and we didn’t go through the trouble of arbitrarily abstracting all that stuff. This project also used shader live-reloading, but it was through a method less polished than my newer ShaderSet design. Leon was working on a Mac, so we had to deal with OpenGL 4.1, which meant we had to use glGetUniformLocation because GL 4.1 doesn’t allow shader-specified uniform locations. This made live-reloading and shader input location conventions a lot more tedious than with newer GL versions.

Speaking of Mac support, it’s really painful that Mac’s GL doesn’t support GL_ARB_debug_output (standard in GL 4.3), which allows you to get an automatic callback whenever a GL error happens. Instead, you have to call glGetError() after every API call, which is really painful. To fix this, we wrote our own OpenGL header that wraps each OpenGL call we used inside a class that defines operator() and automatically calls glGetError() after every call. You can see our implementation here It’s a bit disgusting, but it worked out well for us. I think it could be improved by wrapping the function pointers at the place where SDL_GL_GetProcAddress is called, which would make the interface more transparent.

You can find the source file below (and the rest of the project it comes from) here:

#pragma once

#include "opengl.h"
#include "shaderreloader.h"
#include "dynamics.h"
#include "profiler.h"

#include <glm/glm.hpp>
#include <glm/gtc/type_precision.hpp>
#include <glm/gtx/dual_quaternion.hpp>

#include <string>
#include <vector>
#include <unordered_map>

struct SDL_Window;

struct PositionVertex
    glm::vec3 Position;

struct TexCoordVertex
    glm::vec2 TexCoord;

struct DifferentialVertex
    glm::vec3 Normal;
    glm::vec3 Tangent;
    glm::vec3 Bitangent;

struct BoneWeightVertex
    glm::u8vec4 BoneIDs;
    glm::vec4 Weights;

// Each bone is either controlled by skinned animation or the physics simulation (ragdoll)
enum BoneControlMode

// Different types of scene nodes need to be rendered slightly differently.
enum SceneNodeType

struct SQT
    // No S, lol. md5 doesn't use scale.
    glm::vec3 T;
    glm::quat Q;

// Used for array indices, don't change!
enum SkinningMethod
    SKINNING_DLB, // Dual quaternion linear blending
    SKINNING_LBS  // Linear blend skinning

// Bitsets to say which components of the animation changes every frame
// For example if an object does not rotate then the Q (Quaternion) channels are turned off, which saves space.
// When a channel is not set in an animation sequence, then the baseFrame setting is used instead.
enum AnimChannel
    ANIMCHANNEL_TX_BIT = 1, // X component of translation
    ANIMCHANNEL_TY_BIT = 2, // Y component of translation
    ANIMCHANNEL_TZ_BIT = 4, // Z component of translation
    ANIMCHANNEL_QX_BIT = 8, // X component of quaternion
    ANIMCHANNEL_QY_BIT = 16, // Y component of quaternion
    ANIMCHANNEL_QZ_BIT = 32 // Z component of quaternion
    // There is no quaternion W, since the quaternions are normalized so it can be deduced from the x/y/z.

// StaticMesh Table
// All unique static meshes.
struct StaticMesh
    GLuint MeshVAO; // vertex array for drawing the mesh
    GLuint PositionVBO; // position only buffer
    GLuint TexCoordVBO; // texture coordinates buffer
    GLuint DifferentialVBO; // differential geometry buffer (t,n,b)
    GLuint MeshEBO; // Index buffer for the mesh
    int NumIndices; // Number of indices in the static mesh
    int NumVertices; // Number of vertices in the static mesh
    int MaterialID; // The material this mesh was designed for

// Skeleton Table
// All unique static skeleton definitions.
struct Skeleton
    GLuint BoneEBO; // Indices of bones used for rendering the skeleton
    glm::mat4 Transform; // Global skeleton transformation to correct bind pose orientation
    std::vector<std::string> BoneNames; // Name of each bone
    std::unordered_map<std::string, int> BoneNameToID; // Bone ID lookup from name
    std::vector<glm::mat4> BoneInverseBindPoseTransforms; // Transforms a vertex from model space to bone space
    std::vector<int> BoneParents; // Bone parent index, or -1 if root
    std::vector<float> BoneLengths; // Length of each bone
    int NumBones; // Number of bones in the skeleton
    int NumBoneIndices; // Number of indices for rendering the skeleton as a line mesh

// BindPoseMesh Table
// All unique static bind pose meshes.
// Each bind pose mesh is compatible with one skeleton.
struct BindPoseMesh
    GLuint SkinningVAO; // Vertex arrays for input to skinning transform feedback
    GLuint PositionVBO; // vertices of meshes in bind pose
    GLuint TexCoordVBO; // texture coordinates of mesh
    GLuint DifferentialVBO; // differential geometry of bind pose mesh (n,t,b)
    GLuint BoneVBO; // bone IDs and bone weights of mesh
    GLuint EBO; // indices of mesh in bind pose
    int NumIndices; // Number of indices in the bind pose
    int NumVertices; // Number of vertices in the bind pose
    int SkeletonID; // Skeleton used to skin this mesh
    int MaterialID; // The material this mesh was designed for

// AnimSequence Table
// All unique static animation sequences.
// Each animation sequence is compatible with one skeleton.
struct AnimSequence
    std::string Name; // The human readable name of each animation sequence
    std::vector<SQT> BoneBaseFrame; // The base frame for each bone, which defines the initial transform.
    std::vector<uint8_t> BoneChannelBits; // Which animation channels are present in frame data for each bone
    std::vector<int> BoneFrameDataOffsets; // The offset in floats in the frame data for this bone
    std::vector<float> BoneFrameData; // All frame data for each bone allocated according to the channel bits.
    int NumFrames; // The number of key frames in this animation sequence
    int NumFrameComponents; // The number of floats per frame.
    int SkeletonID; // The skeleton that this animation sequence animates
    int FramesPerSecond; // Frames per second for each animation sequence

// AnimatedSkeleton Table
// Each animated skeleton instance is associated to an animation sequence, which is associated to one skeleton.
struct AnimatedSkeleton
    GLuint BoneTransformTBO; // The matrices used to transform the bones
    GLuint BoneTransformTO; // Texture descriptor for the palette
    GLuint SkeletonVAO; // Vertex array for rendering animated skeletons
    GLuint SkeletonVBO; // Vertex buffer object for the skeleton vertices
    int CurrAnimSequenceID; // The currently playing animation sequence for each skinned mesh
    int CurrTimeMillisecond; // The current time in the current animation sequence in milliseconds
    float TimeMultiplier; // Controls the speed of animation
    bool InterpolateFrames; // If true, interpolate animation frames
    std::vector<glm::dualquat> BoneTransformDualQuats; // Skinning palette for DLB
    std::vector<glm::mat3x4> BoneTransformMatrices; // Skinning palette for LBS
    std::vector<BoneControlMode> BoneControls; // How each bone is animated

    // Joint physical properties
    std::vector<glm::vec3> JointPositions;
    std::vector<glm::vec3> JointVelocities;

// SkinnedMesh Table
// All instances of skinned meshes in the scene.
// Each skinned mesh instance is associated to one bind pose, which is associated to one skeleton.
// The skeleton of the bind pose and the skeleton of the animated skeleton must be the same one.
struct SkinnedMesh
    GLuint SkinningTFO; // Transform feedback for skinning
    GLuint SkinnedVAO; // Vertex array for rendering skinned meshes.
    GLuint PositionTFBO; // Positions created from transform feedback
    GLuint DifferentialTFBO; // Differential geometry from transform feedback
    int BindPoseMeshID; // The ID of the bind pose of this skinned mesh
    int AnimatedSkeletonID; // The animated skeleton used to transform this mesh

// Ragdoll Table
// All instances of ragdoll simulations in the scene.
// Each ragdoll simulation is associatd to one animated skeleton.
struct Ragdoll
    int AnimatedSkeletonID; // The animated skeleton that is being animated physically
    std::vector<Constraint> BoneConstraints; // all constraints in the simulation used to stop the skeleton from separating
    std::vector<glm::ivec2> BoneConstraintParticleIDs; // particle IDs used in the bone distance constraints
    std::vector<glm::ivec3> JointConstraintParticleIDs; // particles IDs used in the joint angular constraints
    std::vector<Hull> JointHulls; // the collision hulls associated to every joint

// DiffuseTexture Table
struct DiffuseTexture
    bool HasTransparency; // If the alpha of this texture has some values < 1.0
    GLuint TO; // Texture object

// SpecularTexture Table
struct SpecularTexture
    GLuint TO; // Texture object

// NormalTexture Table
struct NormalTexture
    GLuint TO; // Texture object

// Material Table
// Each material is associated to DiffuseTextures (or -1 if not present)
// Each material is associated to SpecularTextures (or -1 if not present)
// Each material is associated to NormalTextures (or -1 if not present)
struct Material
    std::vector<int> DiffuseTextureIDs; // Diffuse textures (if present)
    std::vector<int> SpecularTextureIDs; // Specular textures (if present)
    std::vector<int> NormalTextureIDs; // Normal textures (if present)

struct TransformSceneNode
    // Empty node with no purpose other than making nodes relative to it

struct StaticMeshSceneNode
    int StaticMeshID; // The static mesh to render

struct SkinnedMeshSceneNode
    int SkinnedMeshID; // The skinned mesh to render

// SceneNode Table
struct SceneNode
    glm::mat4 LocalTransform; // Transform relative to parent (or relative to world if no parent exists)
    glm::mat4 WorldTransform; // Transform relative to world (updated from RelativeTransform)

    SceneNodeType Type; // What type of node this is.
    int TransformParentNodeID; // The node this node is placed relative to, or -1 if none

        TransformSceneNode AsTransform;
        StaticMeshSceneNode AsStaticMesh;
        SkinnedMeshSceneNode AsSkinnedMesh;

struct Scene
    std::vector<StaticMesh> StaticMeshes;

    std::vector<Skeleton> Skeletons;

    std::vector<BindPoseMesh> BindPoseMeshes;

    std::vector<AnimSequence> AnimSequences;

    std::vector<AnimatedSkeleton> AnimatedSkeletons;

    std::vector<SkinnedMesh> SkinnedMeshes;

    std::vector<Ragdoll> Ragdolls;

    std::vector<DiffuseTexture> DiffuseTextures;
    std::unordered_map<std::string, int> DiffuseTextureNameToID;

    std::vector<SpecularTexture> SpecularTextures;
    std::unordered_map<std::string, int> SpecularTextureNameToID;

    std::vector<NormalTexture> NormalTextures;
    std::unordered_map<std::string, int> NormalTextureNameToID;

    std::vector<Material> Materials;
    std::vector<SceneNode> SceneNodes;

    // Skinning shader programs that output skinned vertices using transform feedback.
    std::vector<const char*> SkinningOutputs;
    ReloadableShader SkinningDLB{ "skinning_dlb.vert" };
    ReloadableShader SkinningLBS{ "skinning_lbs.vert" };
    ReloadableProgram SkinningSPs[2];
    GLint SkinningSP_BoneTransformsLoc;

    // Scene shader. Used to render objects in the scene which have their geometry defined in world space.
    ReloadableShader SceneVS{ "scene.vert" };
    ReloadableShader SceneFS{ "scene.frag" };
    ReloadableProgram SceneSP{ &SceneVS, &SceneFS };
    GLint SceneSP_ModelWorldLoc;
    GLint SceneSP_WorldModelLoc;
    GLint SceneSP_ModelViewLoc;
    GLint SceneSP_ModelViewProjectionLoc;
    GLint SceneSP_WorldViewLoc;
    GLint SceneSP_CameraPositionLoc;
    GLint SceneSP_LightPositionLoc;
    GLint SceneSP_WorldLightProjectionLoc;
    GLint SceneSP_DiffuseTextureLoc;
    GLint SceneSP_SpecularTextureLoc;
    GLint SceneSP_NormalTextureLoc;
    GLint SceneSP_ShadowMapTextureLoc;
    GLint SceneSP_IlluminationModelLoc;
    GLint SceneSP_HasNormalMapLoc;
    GLint SceneSP_BackgroundColorLoc;

    // Skeleton shader program used to render bones.
    ReloadableShader SkeletonVS{ "skeleton.vert" };
    ReloadableShader SkeletonFS{ "skeleton.frag" };
    ReloadableProgram SkeletonSP{ &SkeletonVS, &SkeletonFS };
    GLint SkeletonSP_ColorLoc;
    GLint SkeletonSP_ModelViewProjectionLoc;

    // Shadow shader program to render vertices into the shadow map
    ReloadableShader ShadowVS{ "shadow.vert" };
    ReloadableShader ShadowFS{ "shadow.frag" };
    ReloadableProgram ShadowSP{ &ShadowVS, &ShadowFS };
    GLint ShadowSP_ModelLightProjectionLoc;

    // true if all shaders in the scene are compiling/linking successfully.
    // Scene updates will stop if not all shaders are working, since it will likely crash.
    bool AllShadersOK;

    // Camera placement, updated each frame.
    glm::vec3 CameraPosition;
    glm::vec4 CameraQuaternion;
    glm::mat3 CameraRotation; // updated from quaternion every frame

    glm::vec3 BackgroundColor;

    // The camera only reads user input when it is enabled.
    // Needed to implement menu navigation without the camera moving due to mouse/keyboard action.
    bool EnableCamera;

    // The current skinning method used to skin all meshes in the scene
    SkinningMethod MeshSkinningMethod;

    // For debugging skeletal animations
    bool ShowBindPoses;
    bool ShowSkeletons;

    // Placing the hellknight (for testing)
    int HellknightTransformNodeID;
    glm::vec3 HellknightPosition;

    bool IsPlaying;
    bool ShouldStep;

    // Damping coefficient for ragdolls
    // 1.0 = rigid body
    float RagdollBoneStiffness;
    float RagdollJointStiffness;
    float RagdollDampingK;

    float Gravity;

    glm::vec3 LightPosition;

    Profiler Profiling;

    // Exponential weighted moving averages for profiling statistics
    std::unordered_map<std::string,float> ProfilingEMAs;

void InitScene(Scene* scene);

void UpdateScene(Scene* scene, SDL_Window* window, uint32_t deltaMilliseconds);


The whole point of the Scene was to store the data of the scene you want to render, without having to specify exactly how it should be rendered. This gives you the freedom of being able to implement different renderers for the same scene data, which is useful for when you’re making a bunch of different OpenGL projects. You can easily implement a new rendering technique without even having to touch the scene loading code, which is nice.

My implementation of Renderer usually has three member functions: Init(), Resize(width, height), and Render(Scene*). Init() does the one-time setup of renderer data, Resize() deletes and recreates all textures that need to be sized according to the screen, and Render() traverses the Scene and renders the data inside it.

I usually begin the program by calling Init() followed by an initial Resize(), which creates all the screen-sized textures for the first time. The code inside Resize() deletes and recreates all textures when the window gets resized, and since deleting is a no-op for null textures it’s safe to call it even after the first Init(). After that, Render() is just called at every frame to render the scene. Before calling Render(), I might also call an Update() method that does any necessary simulation work, like if I want to run some game logic. You can also be lazy and write that simulation update code in the Render(), whatever makes sense.

Your renderer probably has to maintain some data that is used to implement whatever specific rendering algorithm you have in mind. This data is what constitutes the viewmodel. I usually don’t separate the viewmodel into its own class, but I guess you could if you somehow find that useful. The “view” itself is basically the Render() function, which uses the model and viewmodel to produce a frame of video.

Below is an example of part of the Resize() function inside the “DoF” project’s renderer (which you can see here: As you can see, I’m deleting and recreating a color and depth texture that need to be sized according to the window, and I also recreate the FBO used to render with them. As you can see, I’m still sticking to my guns about writing plain standard OpenGL code.

glDeleteTextures(1, &mBackbufferColorTOSS);
glGenTextures(1, &mBackbufferColorTOSS);
glBindTexture(GL_TEXTURE_2D, mBackbufferColorTOSS);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_SRGB8_ALPHA8, mBackbufferWidth, mBackbufferHeight);
glBindTexture(GL_TEXTURE_2D, 0);

glDeleteTextures(1, &mBackbufferDepthTOSS);
glGenTextures(1, &mBackbufferDepthTOSS);
glBindTexture(GL_TEXTURE_2D, mBackbufferDepthTOSS);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_DEPTH_COMPONENT32F, mBackbufferWidth, mBackbufferHeight);
glBindTexture(GL_TEXTURE_2D, 0);

glDeleteFramebuffers(1, &mBackbufferFBOSS);
glGenFramebuffers(1, &mBackbufferFBOSS);
glBindFramebuffer(GL_FRAMEBUFFER, mBackbufferFBOSS);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, mBackbufferColorTOSS, 0);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_TEXTURE_2D, mBackbufferDepthTOSS, 0);
GLenum fboStatus = glCheckFramebufferStatus(GL_FRAMEBUFFER);
    fprintf(stderr, "glCheckFramebufferStatus: %x\n", fboStatus);
glBindFramebuffer(GL_FRAMEBUFFER, 0);

The code I write inside the Scene loading and inside Init() and Resize() tries to be careful about binding. I always set my bindings back to zero after I’m done working with an object, since I don’t want to leak any state. This is especially important for stuff like Vertex Array Objects, which interacts in highly non-obvious ways with the buffer bindings (see the mesh loading example code earlier in this article.)

When writing code in the Render() function, keeping your state clean becomes more difficult, since you probably are going to set a bunch of OpenGL pipeline state to get the effects you want. One approach to this problem is to glGet the old state, set your state, do your stuff, then set the old state back. This is maybe a best-effort approach for OpenGL middleware, but I really hate having to type all the glGet stuff. For this reason, I generally write code that assumes the pipeline is in its default state, and I make sure to re-set the pipeline to its default state at the end of my work. This leaves me with no surprises. Notice that this is yet another place where I try to express everything in terms of plain standard OpenGL code, since that makes it easier to see which pipeline state is set where, which makes it easier to simply read the code and see that all the state for a block of code is set and reset properly.

As an example of Render(), consider the following bit of code that renders all the models in the scene:

glBindFramebuffer(GL_FRAMEBUFFER, mBackbufferFBOMS);
glViewport(0, 0, mBackbufferWidth, mBackbufferHeight);

glClearColor(100.0f / 255.0f, 149.0f / 255.0f, 237.0f / 255.0f, 1.0f);

Camera& mainCamera = mScene->Cameras[mScene->MainCameraID];

glm::vec3 eye = mainCamera.Eye;
glm::vec3 up = mainCamera.Up;

glm::mat4 V = glm::lookAt(eye, mainCamera.Target, up);
glm::mat4 P;
    float f = 1.0f / tanf(mainCamera.FovY / 2.0f);
    P = glm::mat4(
        f / mainCamera.Aspect, 0.0f, 0.0f, 0.0f,
        0.0f, f, 0.0f, 0.0f,
        0.0f, 0.0f, 0.0f, -1.0f,
        0.0f, 0.0f, mainCamera.ZNear, 0.0f);

glm::mat4 VP = P * V;


glUniform3fv(SCENE_CAMERAPOS_UNIFORM_LOCATION, 1, value_ptr(eye));

for (uint32_t instanceID : mScene->Instances)
    const Instance* instance = &mScene->Instances[instanceID];
    const Mesh* mesh = &mScene->Meshes[instance->MeshID];
    const Transform* transform = &mScene->Transforms[instance->TransformID];

    glm::mat4 MW;
    MW = translate(-transform->RotationOrigin) * MW;
    MW = mat4_cast(transform->Rotation) * MW;
    MW = translate(transform->RotationOrigin) * MW;
    MW = scale(transform->Scale) * MW;
    MW = translate(transform->Translation) * MW;

    glm::mat3 N_MW;
    N_MW = mat3_cast(transform->Rotation) * N_MW;
    N_MW = glm::mat3(scale(1.0f / transform->Scale)) * N_MW;

    glm::mat4 MVP = VP * MW;

    glUniformMatrix4fv(SCENE_MW_UNIFORM_LOCATION, 1, GL_FALSE, value_ptr(MW));
    glUniformMatrix3fv(SCENE_N_MW_UNIFORM_LOCATION, 1, GL_FALSE, value_ptr(N_MW));
    glUniformMatrix4fv(SCENE_MVP_UNIFORM_LOCATION, 1, GL_FALSE, value_ptr(MVP));

    for (size_t meshDrawIdx = 0; meshDrawIdx < mesh->DrawCommands.size(); meshDrawIdx++)
        const GLDrawElementsIndirectCommand* drawCmd = &mesh->DrawCommands[meshDrawIdx];
        const Material* material = &mScene->Materials[mesh->MaterialIDs[meshDrawIdx]];

        if (material->DiffuseMapID == -1)
            glBindTexture(GL_TEXTURE_2D, 0);
            const DiffuseMap* diffuseMap = &mScene->DiffuseMaps[material->DiffuseMapID];
            glBindTexture(GL_TEXTURE_2D, diffuseMap->DiffuseMapTO);

        glUniform3fv(SCENE_AMBIENT_UNIFORM_LOCATION, 1, material->Ambient);
        glUniform3fv(SCENE_DIFFUSE_UNIFORM_LOCATION, 1, material->Diffuse);
        glUniform3fv(SCENE_SPECULAR_UNIFORM_LOCATION, 1, material->Specular);
        glUniform1f(SCENE_SHININESS_UNIFORM_LOCATION, material->Shininess);

            GL_UNSIGNED_INT, (GLvoid*)(sizeof(uint32_t) * drawCmd->firstIndex),
glBindTextures(0, kMaxTextureCount, NULL);

glBindFramebuffer(GL_FRAMEBUFFER, 0);

As you can see, I set the state at the start of this rendering pass, then reset it back to its default at the end. Since it’s just one block of code, I can easily double-check the code to make sure that there’s an associated reset for every set. There are also some shortcuts to use, for example you can use “glBindTextures(0, 32, NULL);” to unbind all textures.

There are still some parts of OpenGL state which I don’t bother to reset. First, I don’t reset the viewport, since the viewport’s default value is mostly arbitrary anyways (the size of the window). Instead, I adopt the convention that one should always set the viewport immediately after setting a framebuffer. I’ve seen too many bugs of people forgetting to set the viewport before rendering anyways, so I’d rather make it a coding convention not to assume what the viewport is. There is no default viewport in D3D12 anyways, so this convention makes sense moving forward. Another example is the clear color and clear depth, which I always set to the desired values before calling glClear (again, you have to do this in newer APIs too, including GL 4.5’s DSA).

Another example of state I don’t bother to reset is the uniforms I set on my programs. OpenGL does remember what uniforms have been set to a program, but I choose to have the convention that one should not rely on this caching. Instead, you should by convention set every uniform a program has between calling glUseProgram() and glDraw*(). Anyways, D3D12 also forces you to re-upload all your root parameters (uniforms) every time you switch root signature (the program’s interface for bindings), so this convention also makes sense moving forward.

There also exist more automatic solutions to preventing binding leakage. If you’re able to define your rendering code as a series of nested loops (see below), then you can be sure that you’re setting all necessary state before each further nested loop, and you can reset the state all together in one place. Another way to design these nested loops is using sort-based draw call bucketing, which you can use to automatically iterate over all the draw calls in your program in an order that respects which state changes are expensive and which ones aren’t. Since that design also loops over draw calls in order of state changes, you can again have some guarantees about the lifetime of your state changes, so you can reset any state at a place that produces the least waste. This was approximately the design we used in fictional-doodle’s renderer, which you can find here:


Conclusions and Future Work

The details of all this stuff is not super important, and the amount of engineering effort you should invest depends on the scale of your projects. Since I’m usually just implementing graphics techniques in isolation, I don’t have to worry too much about highly polished interfaces. I’ve enjoyed the benefits of being able to reuse Scene code, and I’ve been able to quickly write new Renderers by copy and pasting another one for some base code. My projects that separate the Scene and Renderer have so far been enjoyable, straightforward, and productive to work on, so I consider this design successful.

One part where this starts to crack is if you’re writing a project that implements a whole bunch of rendering techniques in one project, like if you’re trying to compare a whole bunch of different rendering techniques. You can probably get away with stuffing everything into the Scene and Renderer, but there’s a certain threshold where you can no longer keep the whole project in your mind at the same time, and it becomes important to limit the set of data that a given function can access to better understand the code. You might be able to get away with grouping up the model or viewmodel data into different structs, whatever helps keep you sane.

Another problem which I haven’t addressed much is multi-threading. OpenGL projects generally don’t worry much about multi-threading since the OpenGL API abstracts almost all that stuff away from you. This abstraction is partly what makes OpenGL a good API for prototyping rendering techniques, but in the future I’d like to apply this design to newer APIs like Vulkan and Direct3D 12. I suspect that I’ll have to identify which data varies “per-frame” (in order to have multiple frames in flight), while using some encapsulation to prevent one frame to read the data that belongs to another frame, or to avoid global state in general. I also suspect that the design of Render() will have to change to allow multi-threaded evaluation (to write commands faster using CPU parallelism), something which I’m currently looking at TBB flow graph to find a solution.

Appendix A: Transformation Hierarchies

It’s commonly desired to have a hierarchy of transforms to render a scene, sometimes called a “scene graph”. In practice, you shouldn’t think too hard about the whole scene graph thing, since that mindset has a lot of problems. If you do want to implement a tree of transforms, I would recommend keeping the links between transforms separate from the transforms themselves. For example, you might have a table for all the transforms, a table for mesh instances, and a table for meshes. Each mesh instance then keeps a foreign key to a transform and mesh that represent it. To support relative transformations, I suggest creating a new table that stores the edges of the transformation tree, meaning each object in the table stores the foreign keys of two transforms: The parent and the child.

The goal of this design is to compute the absolute transforms of each object at each frame, based on their relative transforms (in the transforms table) and the edges between transforms (in the transform edge table). To do this, make a copy of the local transforms array, then iterate over the transform edges in partial-order based on the height in the tree of the child of each edge. By iterating over the edges in this order to propagate transforms, you’ll always update the parent of a node before its children, meaning the transforms will propagate properly from top to bottom of the tree. You can then use these absolute transforms as a model matrix to render your objects.

By keeping the edges between the transforms in their own table, the cost of the propagation will be proportional to the number of edges, rather than the number of nodes. This is nice if you have fewer edges than nodes, which is likely if your tree is mostly flat. What I mean is that scene tree designs can easily end up with 1 root node with a bazillion children, which is just useless, and is avoided in this design. This also keeps the data structure for your scene nice and flat, without having to have a recursive scene node data structure (like many scene graph things end up).

Then again, I didn’t apply this in the examples earlier in this article, and everything still turned out ok. Pick the design that makes sense for your workload.

Appendix B: User Interface

You probably want some amount of GUI to work with the renderer. Hands down, the easiest solution is to use Dear ImGui. You can easily integrate it into your OpenGL project by copy-pasting the code from one of their example programs. ImGui is used in the dof program mentioned earlier in this article, so you can consult how I used it there too if you’re interested.

Working with cameras is also important for 3D programming of course. Programming a camera is kinda tedious, so I have some ready-made libraries for that too. You can use flythrough_camera to have a camera that lets you fly around the 3D world, and you can use arcball_camera to have a camera that lets you rotate around the scene. Both of these libraries are a single C file you can just add to your project, and you use them by calling an immediate-mode style update function once per frame. These libraries tend to integrate naturally.

Appendix C: Loading OpenGL Functions

One of my pet peeves with typical OpenGL programs is that people use horribly over-engineered libraries like GLEW in order to dynamically link OpenGL functions. You can do something way simpler if you only need to use Core GL and the few extra nonstandard functions that are actually useful. What I did was write a script that read contents of Khronos’ official glcorearb.h header, and produces my own opengl.h and opengl.cpp files. The header declares extern function pointers to all OpenGL entry points, and opengl.cpp uses SDL’s SDL_GL_GetProcAddress function to load all the functions.

To use it, you just have to add those two files to your project, call OpenGL_Init() once after creating your OpenGL context, and you’re good to go. No need to link an external library (like glew), and no need to deal with all of glew’s cruft. You can find the files here: opengl.h and opengl.cpp. You can easily modify them by hand if you want to adapt them to glfw. Since these files are so simple, they’re much more “hackable”, allowing you to do things like wrapping calls in glGetError (like was done in the “fictional-doodle” project, as I explained earlier).

As a bonus, this header also defines GLDrawElementsIndirectCommand and GLDrawArraysIndirectCommand, which are not in the standard GL headers but exist only in the GL documentation for indirect rendering. You can use these structs to pass around draw arguments (perhaps storing them once up front in your Scene) or use them in glDrawIndirect calls.

TBB Flow Graph Nodes Summary

I wanted to design a multi-threaded DX12 rendering pipeline, and saw that Intel’s TBB Flow Graph might make things a lot easier for me. I felt that the documentation for the different flow graph nodes was poorly organized, which made it difficult to understand what all the different flow graph nodes do at a high level. For this reason, I wrote this blog post as a summary that puts it all in one place.

This blog post isn’t a full tutorial on TBB Flow Graph. You should read the Intel-provided documentation first. This blog post is mostly a summary in my own words of the official documentation.

If there are mistakes in my interpretation of the official documentation, or if there are important details missing, please tell me about it so I can correct it. I’ve left a few “unanswered question” bullet points in this article for information that seems lacking from the official documentation. If you have a good answer to these questions, please let me know.

Without further ado, let’s begin the summary.

Functional Nodes

The functional nodes are source_node, function_node, continue_node, and multifunction_node. These nodes have the purpose of doing computations on their inputs (if it has any) and passing outputs to their successors.



The source_node has no predecessors. It evaluates its body and outputs its result constantly. The source_node is always serial, so its body will never be called concurrently. If no successor accepts the message, the message will be put in a buffer, and it will be re-sent if a successor is added or a successor calls try_get().

The source_node keeps outputting items until its body returns false.

  • Unanswered question: can it be reactivated once the body returns false?



The function_node receives an input, applies the body’s function to it, and broadcasts its output to its successors.

The function_node’s maximum concurrency can be specified in its constructor. Setting the concurrency to “tbb::flow::serial” means the body will never be executed concurrently, while setting it to “tbb::flow::unlimited” puts no limit on the body’s concurrency. The concurrency can also be set to a specific number.

If the function_node’s concurrency is already maximized and it receives yet another input, the behavior depends on its graph_buffer_policy template argument. If the buffer policy is “queueing”, then extra inputs will be put on an internal unbounded input buffer until they can be processed. Alternatively, if the policy is “rejecting”, then the function_node will reject the extra inputs, and will instead pull from its predecessors when it becomes ready to handle more inputs. The internal buffer’s allocator can be customized.



The continue_node waits for its predecessors to finish before starting its successors. It counts how many times try_put() was called on it, and when the counter reaches a target value its body is executed and its result is broadcast to its successors. By default, the target value is the number of predecessors.

Despite having N predecessors, the continue_node receives no inputs from them. Emphasis: the continue_node is not a join_node. It receives no input from its predecessors. However, the continue_node still has the ability to broadcast its output to successors.

  • Note: The countdown is illustrated by the 3 horizontal bars on the left side of the diagram above.


continue_msg is an empty struct you can use as a placeholder type, like when you just want to kick-off a successor. Its use is not limited to continue_node, but one might be inclined to think that… which is why I put a note of it here.



The multifunction_node is similar to the function_node, with the main difference that it has multiple output ports. The multifunction_node’s body can try_put() to many of its output ports in one invocation. The outputs are represented as a tuple of output ports, which allows you to hook up output ports to successors, and allows you to decide which ports to send outputs to from within the body.

Note: “multifunction_node” used to be called “multioutput_function_node”

Buffering Nodes

Buffering nodes allow you to pass data from one node to another using a specific buffering scheme.



The buffer_node contains internal storage that stores the messages it receives. The allocator for the buffer can be customized.

Buffered messages are forwarded to the successors in an arbitrary order. Each message is sent to only a single successor, and this successor is chosen by trying to send the message to each successor in the order they were registered, until a successor accepts the message.

Successors can reserve a message from the buffer, which gives the successor ownership of that message without removing it from the buffer. The successor can then either release ownership (leaving the message in the buffer for another successor to grab), or consume the message (removing it from the buffer.) While a message is reserved, other messages can still be added or removed from the buffer.



The priority_queue_node is similar to the buffer_node. The difference is that the messages are forwarded in order of largest priority first, as defined by a Compare template argument (defaulting to std::less).

Unlike the buffer_node, if a successor reserves a message from the priority queue, no other reservations can be made (since this would violate the priority queue’s order.) However, messages can still be pushed to the priority queue while it is reserved.

  • Unanswered question: What happens if you reserve a priority queue, then push a message that makes its way to the front of the queue?



The queue_node is similar to the priority_queue_node, except that messages are sorted in first-in first-out (FIFO), in order of their insertion.



The sequencer_node is similar to the priority_queue_node, except that messages are sorted in sequence order. Sequence numbers are of type size_t, and the sequence number of a given message is determined by a user-specified function.



The overwrite_node contains a buffer big enough for only a single message. When the value is set, the new value is broadcast to all successors, but (unlike other buffer nodes) the value doesn’t get removed from the buffer when it gets sent to successors. Successors can still call try_get() and receive the same value. The node’s buffer is considered invalid until it gets set for the first time.



The write_once_node is similar to the overwrite_node. However, the single value in the buffer cannot be overwritten. If you try to put while a value is already in the buffer, the put is simply ignored. If you want to change the value in the buffer, you have to explicitly clear() it first, then you can put a new value.

Split/Join Nodes

Split/join nodes have the sole purpose of combining or separating streams. They either convert separate messages into tuples of messages, or the reverse.



The join_node receives multiple inputs, packs them into a tuple, then forwards that tuple to its successors. There are a few different policies for the join’s buffering, as shown above and explained below. The buffering policy is set with a template argument.

Queueing Join


With a “queueing” join policy, the join node is given an unbounded FIFO queue for each input port. When each input port’s FIFO has at least one item in it, the inputs are removed from the queue, packed into a tuple, and broadcasted to its successors.

Reserving Join


With a “reserving” join policy, the need for a FIFO is removed by using a “pull” approach rather than a “push” approach (as done in the queueing join.) When a predecessor puts to an input, the reserving join node marks that input as “potentially available”. When all inputs have been marked potentially available, the reserving join will try to reserve each input to see if they are still actually available. If not all inputs could be successfully reserved, all inputs are released.

If all inputs were successfully reserved, then they are packed into a tuple and broadcasted to successors. If at least one successor accepts the tuple, the input reservations are consumed. Otherwise, the inputs get released.

Since reservation is based on a pull approach, something weird happens: By design, the first push to a reserving join input port will always fail. Instead, this causes the port to switch to pull mode, which will then try_reserve the input from the predecessor. If all inputs can be reserved, they are try_consumed. For these reasons, you typically want the inputs to a reserving join to be buffer nodes (which support pulling).

For more details on reservation:

Key Matching (and Tag Matching) Join


With a “key_matching” join policy, a user-provided hash function is applied to the messages to get their associated key, and the key/value pair is stored in a hash table. When a given key exists in a key/value pair in all input ports simultaneously, the key/value pairs are removed from the hash table, and their values are packed into a tuple which is broadcasted to the successors. If no successors accept the tuple, it is saved and will be forwarded on a subsequent try_get().

  • Unanswered question: What buffering schemeย  is used for the stored tuples? A buffer_node, or a queue_node, or what?

The “tag_matching” policy is a specialization of key_matching. The definition is simple:

typedef key_matching tag_matching;

The “tag_value” type is a simple typedef for uint64_t, so this just means you have to use a hash function that returns a uint64_t. That’s all.



The “split_node” gets a tuple as input, and hooks up each element of the tuple to a different output port. You can then connect each output port to a different successor.



The “indexer_node” receives inputs from multiple input ports, and broadcasts them to all successors. Unlike join nodes, the indexer node does not wait for all input slots to have a value. Instead, a message is broadcasted whenever any input port receives an input. The successors receive a tagged union (aka variant) that says which indexer’s input slot the value came from, and then allows you to retrieve the value of the message.

Note: “indexer_node” was previously called “or_node”.

Other Nodes

The TBB designers put some miscellaneous node types into this category.



The “broadcast_node” simply broadcasts its inputs to all its successors. There is no buffering in this node, so messages are forwarded immediately to all successors.



The “limiter_node” is similar to a broadcast_node, but it can only output a limited number of messages. Outputted messages only count toward this threshold if the message is accepted by at least one successor. Once the maximum number of messages has been outputted, the limiter no longer accepts inputs.

The counter can be decremented by calling try_put() on the limiter_node’s “decrement” member object. The decrementer is an object which receives continue_msg as an input. The decrementer then makes its parent limiter reduce its output count by 1, and attempts to get a message from a predecessor then forward it to its successors. If the forwarding is successful, its output count will be increased by 1, bringing it back to its original value. If the forwarding is not successful, then the output count will have been reduced by 1.

An example use of this node is building a pipeline. See:



The “async_node” is designed to communicate with systems outside TBB. The body of the async node receives a “gateway_type” as input, which you can pass to an external system. The gateway interface lets your external system try_put() to the async node’s successors. Also, the gateway interface lets you reserve and release. Reserving/releasing tells the flow graph that async work has started and ended (respectively). This lets the flow graph know when a wait_for_all() on this node should finish.

The maximum concurrency and the buffering policy of this node can be configured.



The “composite_node” allows you to wrap multiple nodes into a single node with defined inputs and outputs. This is done by calling set_external_ports(), which lets you specify the types of the inputs and outputs of your composite node.


The kernel arguments and range can be read from input ports

This node is pretty special. It’s designed for streaming the submission of kernels and data to devices through queues, used to integrate heterogeneous systems with streaming hardware (like GPUs) into TBB flow graphs. Using it requires some configuration in order to properly pass data and kernels to the device. Intel provides such a prepackaged configuration in “opencl_node”, see this overview.