Principles of Compute (Part 1)

Introduction

The free lunch is over, and all software engineers need to understand how to write parallel code if they want to continue seeing performance improvements with future CPUs. Similarly, we have to think parallel if we want to take advantage of massively parallel computation resources like GPUs or FPGAs.

To fully exploit these hardware resources, people started designing so-called compute languages. For example, the CUDA language allows you to write massively parallel programs to run on NVIDIA GPUs, and the OpenCL language allows you to write massively parallel programs that can efficiently target CPUs, GPUs, and FPGAs. Graphics APIs like OpenGL and DirectX also allow you to write massively parallel geometry and pixel processing programs that run on GPUs, and on CPUs too with software renderers like WARP.

In recent years, OpenGL and DirectX have embraced so-called “compute shaders”, which allow general-purpose compute programs to run alongside the traditional geometry and pixel processing programs. Compute shaders brought significant innovation to renderer design, and are now an integral part of advanced rendering engines.

So Why Do We Care?

Let’s take a step back and think. Everything we’re talking about here is strictly for the purpose of optimization. Compute programs don’t let you implement logic that wasn’t possible to implement previously. At best, compute languages let you implement the same logic more efficiently, at the likely cost of more development time. Of course, performance is still very important. If we didn’t care about performance, we wouldn’t be using GPUs at all, but if you’re a gamer or a game developer, you probably know how important high-performance GPUs are for 3D gaming graphics to work at enjoyable frame rates.

The reason I’m emphasizing the importance of optimization is that writing efficient programs is highly dependent on your understanding of the relevant computer hardware. We’ve entered the domain of embedded systems, meaning that we are designing software systems with the intention of running them on a specific kind of hardware. As a software engineer, you are solving a problem (“I want to make such-and-such video game”), and you must solve this problem with a hybrid hardware-software solution. Hardware details are more than just a good conversation starter at software development conferences… Understanding the hardware is your job, and it’s the reason employers will pay top dollar to borrow your brain. It’s the argument of kings: This is embedded systems. We care.

Why Read This Article

The goal of this article is to gain an intuitive understanding of how compute works. Instead of memorizing arbitrary OpenGL/OpenCL/DirectX/CUDA terminology, we’ll put ourselves in the shoes of somebody who wants to write programs that fully exploit hardware resources, and we’ll see how this leads to the design of the compute languages we know and love today.

By understanding the principles behind the API, you can mostly spare yourself from having to memorize any API details at all. You can make your design on paper independently from whether you’re using OpenGL or DirectX or whatever, then you can translate it to whatever API-specific syntax when the time comes to implement your design.

Furthermore, if you do try to understand compute through only (eg) OpenGL’s documentation, you might find it very difficult. The specifications of compute languages are vague by nature, since they give generous flexibility to be implemented on widely different hardware. This vagueness makes it hard to understand these specifications… Unless you already have a lot of background knowledge about the underlying hardware, in which case you probably barely need to read the specification in the first place. To deal with this vagueness, this article tries to teach this required background knowledge that specifications otherwise omit.

Finally, it’s important to realize that contemporary compute APIs like OpenGL and DirectX are evolving at a fast pace. In 5 years, compute won’t look the same. In 10 years, compute capabilities will have evolved significantly in flexibility and performance. In 20 years, what we’re doing now will look primitive and uncivilized. If you’re reading this now, you’re probably going to see this evolution first-hand. Maybe you’ll be a user, and maybe you’ll be an implementer too. Either way, you have the potential to shape the future, so let’s think about how things should be rather than how things are.

Scaling with Hardware

Previously, developers could write non-parallel (scalar) code, and see their programs run significantly faster with every new generation of hardware. Unfortunately, as CPU designers run up against the physical limits of the universe, it has become more cost-efficient to, for example, create a CPU with two cores at the same (or lower) frequency, rather than trying to create a CPU with a single core at double the frequency.

There is an obvious trend of hardware designers deciding to replicate hardware resources rather than trying to speed up existing hardware resources. As before, we would like to write programs that improve in performance over time with new generations of hardware. To do this, we have to write programs that improve in performance as hardware resources become more and more replicated in the future.

Replicated hardware resources allow us to improve the performance our programs, but only in a specific kind of way. Seeing this requires a shift in mindset. Consider for example using a binary search tree or a hash table to optimize a linear search. Thinking about data structure improvements to improve search time is certainly important, but we also have to consider if and how these data structures would speed up the program if the user ran it with additional hardware resources.

In contrast to single-threaded optimizations, scaling with additional hardware requires distributing the data among available hardware resources. For example, with replicated comparison circuits, we can accelerate comparisons while searching by doing more than one comparison in parallel. With replicated arithmetic hardware, we can compute the hash code for more than one search query simultaneously. With replicated hardware for memory access, we can accelerate the access of the data structure we’re searching in. What these optimizations have in common is that they distribute the data of the input domain among hardware resources, an approach formally known as data-parallelism. Identifying ways to run your code in a data-parallel way is the key to writing code that scales with hardware.

The Red Herring: System-on-a-Thread

In the earlier days of game engine programming, developers were thinking about scaling with hardware by associating software systems to cores. For example, a game programmer might have decided to exploit a dual core CPU by running their graphics system on one core and their audio system on the second core, which is done by creating one operating system thread for each of these systems. This so-called “system-on-a-thread” solution has some serious scalability problems. If the audio thread finishes processing one frame of audio before the graphics thread finishes processing one frame of video, then the audio thread has to become idle while waiting for the graphics thread to catch up. In that period of time, only one out of two CPU cores is active, which means that computation power is left on the table.

Idling cores isn’t the only problem with system-on-a-thread. What happens if you want to add a physics system to the game engine, and the physics system also wants to run on its own thread? In this case, we have 3 system threads but only 2 CPU cores. This means we have over-subscription, which means that we have more threads than cores. This is undesirable, since the CPU cores have now become contended resources between the threads.

One core can only run one thread at a time, so the operating system will need to share each core’s running time among the threads assigned to it. When the operating system switches a core from running one thread to running another thread, an expensive context-switch happens. In the worst case, your program might spend more time context-switching between threads than doing useful work.

There is an important lesson to learn from this: Additional parallelism in software can only be useful if it can be efficiently allocated to hardware resources. When it comes down to it, software systems can’t use more circuits than what physically exists on the chip. Only hardware designers have the power to add additional circuits, and as a software designer you have to work with what you’re given.

The problems with system-on-a-thread underline the importance of data-parallelism. If we identify general ways of splitting up the data we’re working on among hardware resources, then we can efficiently split up work between available hardware, independently of how much hardware we really have.

Designing Parallel Programs

To design data-parallel programs, we need to understand what can and can’t be made parallel. As implied by the term “data-parallelism”, the goal is to divide the work into pieces that can be executed independently on the data that needs to be processed, which allows us to execute these pieces of work in parallel. By designing work that can run independently from other work, we guarantee that the execution of one piece of work won’t affect the results of another piece of work. This is important, since work that executes in parallel will likely not execute in a consistent order.

The Problem with Threads

Let’s take a moment to appreciate the problems that arise when parallel code doesn’t run in a consistent order.

Consider the following example program:

/* Thread 1 */           |  /* Thread 2 */
a = 1;                   |  a = 0;
printf("a == %d\n", a);  |

This program is made up of two threads. The order in which the two threads execute their code is unpredictable, since it depends on the inner workings of the C compiler, the operating system and the hardware.

In one run of the program, the execution might happen in the following order:

a = 0;
a = 1;
printf("a == %d\n", a); // -> prints "a == 1"

In this case, Thread 2 ran its first line of code first, then Thread 1 ran to completion. As a result, the program finished with “a” set to 1.

In another run of the program, the execution might happen in this different order:

a = 1;
a = 0;
printf("a == %d\n", a); // -> prints "a == 0"

In this case, Thread 1 ran its first line of code, then Thread 2 ran its first line of code, then Thread 1 ran its second line of code. As a result, the program finished with “a” set to 0.

What this short example shows is that multi-threaded programs can produce different results based on the order in which the code executes on different threads. This unpredictability is bad news for programmers, since it makes the code much harder to understand. To quote Edward A. Lee’s “The Problem with Threads”:

a folk definition of insanity is to do the same thing over and over again and expect the results to be different. By this definition, we in fact require the programmers of multi-threaded systems be insane. Were they sane, they could not understand their programs.

When the order of execution of different threads causes the results of the program to change, we call that a “race condition”, to mean that the conditions in which the program’s results differ if one thread reaches a certain point of execution before another. If we want to resolve a race condition, we can identify synchronization points in the code, which can be implemented using an operating system synchronization object like a mutex, a semaphore, a condition variable, a critical section, an interlocked (atomic) operation, and more.

The synchronization objects I just described are very interesting to study and extremely useful, but they also all have relatively high execution costs. Instead, we should aim to avoid needing synchronization at all, and this can be achieved if we design algorithms that avoid race conditions entirely. To do this, we first need to know how to formally identify race conditions.

Identifying Race Conditions

As we saw in the previous section, running two functions in parallel may result in race conditions, which can lead to programs with indeterminate results. There are two main culprits to this race condition:

  • Shared memory
  • Side-effects

Shared memory was a problem in the previous example, since the race condition we observed was partly caused by the fact that the memory for the variable “a” was shared between the two threads. If the code was re-designed to make the variable “a” only exist in one thread (as opposed to being shared between the two threads), then it would be impossible for the variable to change due to the execution of a different thread, since those other threads simply don’t have access to it.

Along with shared memory, the second cause of race conditions in the example we saw was that each thread had side-effects. For example, consider the following program:

/* Thread 1 */           |  /* Thread 2 */
b = a;                   |  c = a;
b += 1                   |  c -= 1;

In this program, both threads had access to the variable “a” through shared memory. However, unlike the program with a race condition that we saw earlier, the two threads only read the value of “a”, and never write to it. Since the shared memory is now read-only, there is no race condition. Regardless of the order in which the lines of code of the two threads above are executed, the final values of “a” and “b” and “c” are always the same.

Based on these observations, we can make a powerful statement: Threads may share memory without race conditions and without synchronization if the memory they share is read-only.

The fact that memory can be safely shared between threads if it is read-only is very interesting from the perspective of a programming language designer. For example, the Haskell programming language is designed so every variable is read-only after being written for the first time. Since this prevents functions from having side-effects, memory can be shared arbitrarily and Haskell programs can thus be highly automatically parallelized. Haskell has other problems that make it arguably hard to use in real-life code, but it’s interesting and eye-opening to study for programmers of all walks of life.

Despite the elegance of languages like Haskell, our day-to-day life using programming languages like C requires us to both read and write shared memory. If we want to avoid race conditions in our programs, then we should ensure that these reads and writes happen in a consistent order. To do this, we must identify and handle the data hazards that happen when reads and writes are done to shared memory. For example, if one thread writes to shared memory then another thread reads that shared memory, that subsequent read should see the value that was just written. In other words, before being able to subsequently read from memory after a write, the reader must wait for the previous writer to finish writing. If the reader doesn’t wait for the writer to finish, the reader might read the previous value of that memory (as opposed to the new one it should have read). Even worse, the reader might read some data that is a Frankenstein-like half-half between the previous value and the new value, also known as a “torn read”.

Since data-hazards are concerned with handling the order of reads and writes, we can exhaustively analyze every possible such scenario:

  1. Read-After-Read (RAR)
  2. Read-After-Write (RAW)
  3. Write-After-Read (WAR)
  4. Write-After-Write (WAW)

The case of Read-After-Read is the one scenario where there is no data hazard. As described previously, there are no race conditions when shared memory is only read and never written.

In the case of Read-After-Write, it’s important for the reader to wait for the writer to finish, otherwise the reader might not see the updated value, or might get a torn read, as was explained earlier.

In the case of Write-After-Read, it’s important for the writer to wait for previous readers to finish, since otherwise the readers might see the new value instead of the old value they were supposed to see. If the writer doesn’t wait for readers to finish, it’s also possible for the writer to cause a torn read by overwriting a value while a reader is only halfway done reading it.

Finally, in the case of Write-After-Write, it’s important that the final value of the data reflects the second write instead of the first write. If this relationship isn’t enforced, successive reads will see the old value (from the first write) instead of the new value (from the second write). Therefore, a writer should wait for previous writers to finish before doing their own write.

Now that we’ve identified all possible data hazards that can cause race conditions, we can more formally identify race conditions in multi-threaded code. Awareness of these data hazards will allow us to design algorithms that avoid them when possible. Furthermore, when we can’t avoid a data hazard, we’ll know that we need to use some kind of synchronization to make sure reads and writes happen in the right order.

Designing Data-Parallel Algorithms

With the knowledge of data hazards, we can now think about designing algorithms that avoid them. This section reviews a variety of methods useful for designing data-parallel algorithms.

Independent Uniform Work

As an initial example, let’s consider the following algorithm:

void times_two(float* data, int n)
{
    for (int i = 0; i < n; i++)
    {
        data[i] = data[i] * 2;
    }
}

Running this code in parallel is fairly easy. Each iteration of the loop is independent from the other iterations, so we can simply distribute the “n” iterations equally among the hardware’s available resources, and that should give ideal hardware usage. This is possible because no memory is shared between the different iterations of the loop, which means that no synchronization is necessary and that the code can be trivially parallelized. We can parallelize the above code among the CPU’s cores trivially with an API like OpenMP, as follows:

void times_two(float* data, int n)
{
    // Runs the for loop below in parallel, distributed among CPU cores.
#pragma omp parallel for
    for (int i = 0; i < n; i++)
    {
        data[i] = data[i] * 2;
    }
}

By the way, OpenMP is a C/C++ compiler extension that allows you to automatically parallelize code. OpenMP is built into most C++ compilers these days, including GCC, clang, and Visual C++. You can enable OpenMP by simply passing a compiler flag on the command line, or by setting a project property in Visual Studio, so you should feel free to experiment with it.

Non-Independent Uniform Work

Let’s consider a slightly more challenging example: An algorithm that “blurs” an array. As a reminder, a basic box blur (ie. on an image) is implemented by computing the average of a pixel and its neighbors. In this case, we’ll take the average of a float and its 2 neighbors in the 1D array (the float before and the float after).

If we wanted to blur a single float in the data array, we might do something as follows (ignoring bounds checking):

    data[i] = (data[i - 1] + data[i] + data[i + 1]) / 3.0f;

The code above cannot simply be used as the body of a loop over the whole array, because the iterations are not independent from each other. The order in which the iterations are executed will change the output of the algorithm, since the “data[i]” written in one iteration will be read by its neighbors, which will change the result of its neighbors’ computations. If we write to data[i] only after the the reads from data[] have been done for all iterations, then the answer will be correct, since the writes will not interfere with the reads. We could see this problem as a Write-After-Read data hazard, since we need to wait for all the reads from data[] to be complete before we do any writes to it.

The Write-After-Read data hazard happens in this example because we’re trying to read and write to the same memory location in a single step. To remove the data hazard, we can use an additional memory allocation. By creating a separate array to store the results, there are no more data hazards. The input array is read-only, and the output[i] is not shared between iterations, so the code can now be easily distributed among available hardware resources in parallel.

Note 1: It is assumed that the “input” and “output” arrays are not overlapping in memory. For example, if both pointers pointed to the same memory, we would have the exact same problem as before.

Note 2: For readability, the bounds-checking for the beginning and end of the array is omitted. If necessary, this could be fixed by handling the first and last iteration as a special case, or by making the data[] array bigger by adding one element before the beginning and one after the end.

void blur(const float* input, float* output, int n)
{
#pragma omp parallel for
    for (int i = 0; i < n; i++) {
        output[i] = (input[i - 1] + input[i] + input[i + 1]) / 3.0f;
    }
}

Ping-ponging

To take the blur example one step further, let’s consider how we can make a “blurrier” result by blurring the results of the blur. In other words, we want to run the blur algorithm multiple times on the array in a loop. We can easily implement this using the blur function defined previously, by alternating between which buffer is used as input and which buffer is used as output. This could be done as follows:

void blur_4_times(float* input, int n)
{
    // note: assuming borders are handled properly inside blur()
    float* tmp = new float[n];

    float* read_buffer = input;
    float* write_buffer = tmp;

    for (int blur_i = 0; blur_i < 4; blur_i++)
    {
        blur(read_buffer, write_buffer, n);
        std::swap(read_buffer, write_buffer);
    }

    delete[] tmp;
}

This code alternates between reading and writing between two buffers. After each iteration, the buffer being read from and the buffer being written to are swapped. This is known as “ping-ponging”, to mean that the direction of the copying between the two buffers is alternating like how a ping-pong ball alternates between players on a ping-pong table. 🙂

Reduction

In the next example, we’ll be looking at how to implement a reduction. A reduction is an operation that turns a set of values into a single value, like for example computing the minimum element in an array. Computing the minimum of an array sequentially without any multi-threading is pretty darn easy. For reference, it might look something like this:

float minimum(const float* data, int n)
{
    // result is undefined if the array passed in is empty
    assert(n >= 1);

    float curr_minimum = data[0];
    
    for (int i = 1; i < n; i++)
    {
        if (data[i] < curr_minimum) {
            curr_minimum = data[i];
        }
    }

    return curr_minimum;
}

In the algorithm above, “curr_minimum” is potentially read and written by every iteration for the loop, meaning there exists a data hazard between each iteration of the loop. This data hazard prevents us from safely running this loop in parallel. In order to find the minimum in parallel, we can implement this algorithm in multiple passes. The big idea is to split the array into many smaller arrays, then find the minimum of each of these smaller arrays in parallel. Once the minimums of the smaller arrays are computed, we can find the minimum of these minimums. This subdivision of the work can be applied repeatedly in order to implement a highly parallel minimum algorithm, as shown below.

float minimum(const float* data, int n)
{
    assert(n >= 1);
    assert(is_power_of_two(n)); // for simplicity

    float* mins = new float[n];
    memcpy(mins, data, sizeof(float) * n);

    int curr_n = n;

    for (int pass = 0; pass < log2(n); pass++)
    {
        #pragma omp parallel for
        for (int i = 0; i < curr_n / 2; i++)
        {
            mins[i] = std::min(mins[2 * i], mins[2 * i + 1]);
        }

        curr_n = curr_n / 2;
    }

    float result = mins[0];

    delete[] mins;

    return result;
}

In the example above, the computation shrinks by half with each pass, since each pass computes the minimum for a pair of inputs from the previous iteration. The number of passes required can be lowered if each iteration of the inner loop computes the minimum of more than 2 values, so you can make a tradeoff between the number of passes and the amount of serial work per pass. This is something you need to tweak based on the hardware you’re working with and the size of your input.

Conclusion/To Be Continued

In this post, the motivation for designing data-parallel algorithms was established, based on the idea that we need to write software that can scale in performance as more hardware (cores, ALUs, etc) is added to processors. From there, we explored the concept of data hazards, showing how they are an obstacle to the implementation of data-parallel algorithms. Finally, a few common ways to implement data-parallel algorithms have been introduced.

In the next post, I’ll talk about how hardware (CPUs, GPUs, etc) can be designed to enable to rapid execution of data-parallel algorithms. Once we understand how data-parallel programs work from the conceptual side and from the hardware side, we’ll be able to talk about designing programming languages that allow efficient implementation of data-parallel algorithms based on principles that can be applied to a wide variety of hardware.

Advertisements

One comment

  1. Pingback: Compile-time check on task parameters (part 1) | Prog stuff

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s