D3D12 Multi-Adapter Survey & Thoughts

Introduction

Direct3D 12 opens up a lot of potential by making it possible to write GPU programs that make use of multiple GPUs. For example, it’s possible to write programs that distribute work among multiple GPUs from linked GPUs (eg: NVIDIA SLI or AMD Crossfire), or even between GPUs from different hardware vendors.

There are many ways to make use of these multi-adapter features, but it’s not obvious yet (at least to me) how to best make use of it. In theory, we should try to make full use of all available hardware on a given computer, but there are difficult problems to solve along the way. For example:

  • How can we schedule GPU tasks to minimize communication overhead between different GPUs?
  • How can we distribute tasks among hardware that vary in performance?
  • How can we use special hardware features? eg: “free” CPU-GPU memory sharing on integrated GPUs.

D3D12 Multi-Adapter Features Overview

To better support multiple GPUs, Direct3D 12 brings two main features:

  1. Cross-adapter memory, which allows one GPU to access memory of other another GPU.
  2. Cross-adapter fences, which allows one GPU to synchronize its execution with another GPU.

Working with multiple GPUs in D3D12 is done explicitly, meaning that sharing memory and synchronizing GPUs must be taken into consideration by the rendering engine, as opposed to being “automagically” done inside GPU drivers. This should lead to more efficient use of multiple GPUs. Furthermore, integrating shared memory and fences into the API allows you to avoid making round-trips to the CPU to interface between GPUs.

For a nice quick illustrated guide to the features described above, I recommend the following article by Nicolas Langley: Multi-Adapter Support in DirectX 12.

D3D12 supports two classes of multi-adapter setups:

  1. Linked Display Adapters (LDA) refers to linked GPUs (eg: NVIDIA SLI/AMD Crossfire). They are exposed as a single ID3D12Device with multiple “nodes”. D3D12 APIs allow you to specify a bitset of nodes when the time comes to specify which node to use, or which nodes should share a resource.
  2. Multiple Display Adapters (MDA) refers to multiple different GPUs installed on the same system. For example, you might have both an integrated GPU and a discrete GPU in the same computer, or you might have two discrete GPUs from different vendors. In this scenario, you have a different ID3D12Device for each adapter.

Another neat detail of D3D12’s multi-adapter features is Standard Swizzle, which allows GPU and CPU to share swizzled textures using a convention on the swizzled format.

Central to multi-adapter code is the fact that each GPU node has its own set of command queues. From the perspective of D3D12, each GPU has a rendering engine, a compute engine, and a copy engine, and these engines are fed through command queues. Using multiple command queues can help the GPU schedule independent work, especially in the case of copy or compute queues. It’s also possible to tweak the priority of each command queue, which makes it possible to implement background tasks.

Use-Cases for Multi-Adapter

One has to wonder who can afford the luxury of owning multiple GPUs in one computer. Considering that multi-adapter wasn’t properly supported before D3D12, it was probably barely worth thinking about, other than scenarios explicitly supported by SLI/Crossfire. In this section, I’ll try to enumerate some scenarios where the user might have multiple GPUs.

“Enthusiast” users with multiple GPUs:

  • Linked SLI/Crossfire adapters.
  • Heterogeneous discrete GPUs.
  • Integrated + discrete GPU.

“Professional” users:

  • Tools for 3D artists with fancy computers.
  • High-powered real-time computer vision equipment.

“Datacenter” users:

  • GPU-accelerated machine-learning.
  • Engineering/physics simulations (fluids, particles, erosion…)

Another potentially interesting idea is to integrate CPU compute work in DirectX by using the WARP (software renderer) adapter. It seems a bit unfortunate to tie everyday CPU work into a graphics API. I guess it might lead to better CPU-GPU interop, or it might open opportunities to experiment with moving work between CPU and GPU and see performance differences. This is similar to using OpenCL to implement compute languages on CPU.

Multi-adapter Designs

There are different ways to integrate multi-adapter into a DirectX program. Let’s consider some options.

Multi-GPU Pipelining

Pipelining with multiple GPUs comes in different flavors. For example, Alternate Frame Rendering (AFR) consists of alternating between GPUs with each frame of rendering, which allows multiple frames to be processed on-the-fly simultaneously. This kind of approach generally requires the scene you’re rendering to be duplicated on all GPUs, and requires outputs of one frame’s GPU to be copied to the inputs to the next frame’s GPU.

AFR can unfortunately limit your design. For example, dependencies between frames can be difficult to implement efficiently. To solve this problem, instead of pipelining at the granularity of frames with AFR, one might pipeline within a frame. For example, half of the frame can be processed on one GPU, then finished on another GPU. In theory, these pipelining approaches should increase throughput, while possibly increasing latency due to the extra overhead of copying data between GPUs (between stages of the pipeline.) For this reason, we have to be careful about the overhead of copies

A great overview of multi-adapter, AFR, and frame pipelining was given in Juha Sjöholm’s GDC 2016 talk: Explicit Multi GPU Programming with DirectX 12

Task-Parallelism

With a good data-parallel division of our work, we can theoretically easily split our work into tasks, then distribute them among GPUs. However, there’s fundamentally a big difference in the ideal level of granularity of parallelism between low-latency (real-time) users and high-throughput (offline) users. For example, work that can be done in parallel within one frame is not always worth running on multiple GPUs, since the overhead of communication might nullify the gains. In general:

  • Real-time programs don’t have much choice outside of parallelism within one frame (or a few frames), since they want to minimize latency, and they can’t predict future user controller inputs anyways.
  • Offline programs might know the entire domain of inputs ahead of time, so they can arbitrarily parallelize without needing to use parallelism within one frame.

If our goal is to render 100 frames of video for a 3D movie, we could split those 100 frames among the available GPUs and process them in parallel. Similarly, if we want to run a machine learning classification algorithm on 1000 images, we can also probably split that arbitrarily between GPUs. We can even deal with varying performance of available GPUs relatively easily: Put the 1000 tasks in a queue, and let GPUs pop them and process them as fast as they allow, perhaps using a work-stealing scheduler if you want to get fancy with load-balancing.

In the case of a real-time application, we’re motivated to use parallelism within each frame to bring content to the user’s face as fast as possible. To avoid the overhead of communication, we might be motivated to split work into coarse chunks. Allow me to elaborate.

Coarse Tasks

To minimize the overhead of communication between GPUs, we should try to run large independent portions of the task graph on the same GPU. Parts of the task graph that run serially are an obvious candidate for running on only one GPU, although you may be able to pipeline those parts.

One way to separate an engine into coarse tasks is to split them based on their purpose. For example, you might separate your project into a GUI rendering component, a fluid simulation component, a skinning component, a shadow mapping component, and a scene rendering component. From there, you can roughly allocate each component to a GPU. Splitting code among high-level components seems like an obvious solution, but I’m worried that we’ll get similar problems as the “system-on-a-thread” design for multi-threading.

With such a coarse separation of components, we have to be careful to allocate work among GPUs in a balanced way. If we split work uniformly among GPUs with varying capabilities, then we can easily be bottlenecked by the weakest GPU. Therefore, we might want to again put our tasks in a queue and distribute them among GPUs as they become available. In theory, we can further mitigate this problem with a fork/join approach. For example, if a GPU splits one of its tasks in half, then a more powerful GPU can pick up the second half of the problem while the first half is still being processed by the first GPU. This approach might work best on linked adapters, since they can theoretically share memory more efficiently.

An interesting approach to load-balancing can be found in GPU Pro 7 chapter 5.4: “Semi-static Load Balancing for Low-Latency Ray Tracing on Heterogeneous Multiple GPUs”. It works by roughly splitting the framebuffer among GPUs to ray trace a scene, and alters the distribution of the split dynamically based on results of previous frames.

One complication of distributing tasks among GPUs is that we might want to run a task on the same GPU at each frame, to avoid having to copy the input state of the task to run it on a different GPU. I’m not sure if there’s an obvious solution to this problem, maybe it’s just something to integrate into a heuristic cost model for the scheduler.

A Note On Power

One quite difficult problem with multi-adapter has to do with power. If a GPU is not used for a relatively short period of time, it’ll start clocking itself down to save power. In other words, if you have a GPU that runs a task each frame then waits for another GPU to finish, it’s possible for that first GPU to start shutting itself down. This becomes a problem on the next frame, since the GPU will have to spin up once again, which takes a non-trivial amount of time. As a final result, the code ends up running slower on multi-adapter than it does in single-adapter, despite even the most obvious opportunities for parallelism.

One might suggest to force the GPU to keep running at full power to solve this problem. It’s not so obvious, since drawing power from idle cores takes away power from the cores that need it. This is especially an issue on integrated GPUs, since the GPU would steal juice from the CPU, despite the CPU probably needing that power to run non-GPU code during the rest of the frame. Of course, power-hungry applications are also generally not welcome on battery-operated devices like laptops or phones.

Does this problem have a solution? Hard to say! As a guideline, it might be important to use GPUs only if you plan to utilize them well, and be careful about CPU-GPU tradeoffs on integrated GPUs. We might need help from hardware and OS people to figure this out properly.

NUMA-aware Task Scheduling

An important challenge of multi-adapter code is that memory allocations have an affinity to a given processor, which means that the cost of memory access increases dramatically when the memory does not belong to the processor accessing it. This scenario is known as “Non-uniform memory access”, aka. “NUMA”. It’s a common problem in heterogeneous and distributed systems, and is also a well-known problem in server computers that have more CPU cores than a single motherboard socket can support, which result in multi-socket CPU configurations where each socket/CPU has a set of RAM chips closer to it than others.

There exist some strategies to deal with scheduling tasks in a NUMA-aware manner. I’ll list some from the literature.

Deferred allocation is a way to guarantee that output memory is local to the NUMA node. It simply consists of allocating the output memory only at the time of the task being scheduled, which allows the processor that was scheduled to perform the allocation right-then-and-there in its local memory, thus guaranteeing locality.

Work-pushing is a method to select a worker to which a task should be sent. In other words, it’s the opposite of work-stealing. The target worker is picked based on a choice of heuristic. For example, the heuristic might try to push tasks to the node that owns the task’s inputs, or it might try to push work to the node that own’s the task’s outputs, or the heuristic might combine ownership of inputs and outputs in its decision.

Work-stealing can also be tweaked for NUMA purposes, by tweaking the work-stealing algorithm to first steal work from nearby NUMA nodes first. This might apply itself naturally to the case of sharing work between linked adapters.

Conclusion

Direct3D 12 enables much more fine-grained control over use of multiple GPUs, whether though linked adapters or through heterogeneous hardware components. Enthusiast gamers, professional users, and GPU compute datacenters stand to benefit from good use of this tech, which motivated a search for designs that use multi-adapter effectively. On this front, we discussed Alternate-Frame-Rendering (AFR), and discussed the design of more general task-parallel systems. The design of a task-parallel engine depends a lot on your use case, and there are many unsolved and non-obvious areas of this design space. For now, we can draw inspiration from existing research on NUMA systems and think about how it applies to the design of our GPU programs.

Using cont with tbb::task_group

Note: Previous post on this topic: https://nlguillemot.wordpress.com/2017/01/12/tbb-task-dag-with-deferred-successors/

In the last post, I showed a proof of concept to implement a “cont” object that allows creating dependencies between TBB tasks in a dynamic and deferred way. What I mean by “dynamic” is that successors can be added at runtime (instead of requiring the task graph to be specified statically). What I mean by “deferred” is that the successor can be added even after the predecessor was created and spawned, in contrast to interfaces where successors need to be created first and hooked into their predecessor secondly.

The Goal

The goal of this post was to create an interface for cont that abstracts TBB details from everyday task code. TBB’s task interface is low level and verbose, so I wanted to have something productive and concise on top of it.

Extending tbb::task_group

tbb::task_group is a pretty easy way to spawn a bunch of tasks and let them run. An example use is as follows:

int Fib(int n) {
    if( n<2 ) {
        return n;
    } else {
        int x, y;
        task_group g;
        g.run([&]{x=Fib(n-1);}); // spawn a task
        g.run([&]{y=Fib(n-2);}); // spawn another task
        g.wait();                // wait for both tasks to complete
        return x+y;
    }
}

I wanted to reuse this interface, but also be able to spawn tasks that depend on conts. To do this, I made a derived class from task_group called cont_task_group. It supports the following additional syntax:

cont<int> c1, c2;
cont_task_group g;
g.run([&]{ foo(&c1); };
g.run([&]{ bar(&c2); };
g.with(c1, c2).run([&] { baz(*c1, *c2); });
g.wait();

The with(c...).run(f) syntax spawns a task to run the function f only when all conts in c... are set.

A full example is as follows:

void TaskA(cont<int>* c, int x)
{
    tbb::task_group g;
    g.run([&] {
        // A Subtask 1
        c->emplace(1337);
        c->set_ready();
    });
    g.run_and_wait([&] {
        // A Subtask 2
    });
}

void TaskB(int y)
{
}

void TaskC(int z)
{
    std::stringstream ss;
    ss << "TaskC received " << z << "\n";
    std::cout << ss.rdbuf();
}

int main()
{
    cont<int> c;
    cont_task_group g;
    g.run([&] { TaskA(&c, 3); });
    g.run([&] { TaskB(2); });
    g.with(c).run([&] { TaskC(*c); });
    g.wait();
}

This builds the following task dependency graph:

task graph

Sample implementation here: GitHub

graph

TBB Task DAG with Deferred Successors

Note: Previous post on this topic: https://nlguillemot.wordpress.com/2017/01/11/cilk-syntax-study/

I’m thinking about how to implement DAGs of tasks that can be used in a natural way. The problem is that a DAG is a more general construct than what task systems usually allow, due to valid performance concerns. Therefore, if I want to implement a DAG more generally, I need to come up with custom hacks.

Trees vs DAGs

The major difference between a tree and a DAG of tasks is that a DAG allows one task to have an arbitrary number of successor tasks. Also, with a DAG, a task from one subtree of tasks can send data to a task from a different subtree of tasks, which allows you to start tasks when all their inputs are ready rather than when all the previous tasks have run and finished. (Hmm… That makes me think of out-of-order versus in-order processors.) This expands on the functionality of a tree of tasks, since trees only allow outputs to be passed to their immediate parent, whereas DAGs can pass data to grandparents or great-grandparents, or to tasks in subsequent trees.

By default, tasks in task systems like TBB and Cilk are designed to have only one successor: a parent task, or a continuation task. Having a parent task makes it possible to have nested tasks, which is useful for naturally spawning tasks from tasks, similarly to how functions can call other functions. Continuation tasks make it potentially more efficient to spawn a follow-up task to handle the results of a task, and they can do so without affecting the reference count of the parent task.

DAG Implementation

To implement a general DAG, you need to (in one way or other) keep track of the connections between the outputs of some tasks and the inputs of other tasks in a more general way. There are two aspects to this connection:

  1. How is memory allocated for the data passed from predecessor task to successor task?
  2. How is the successor task spawned when its inputs are all satisfied?

According to Intel’s documentation (See: General Acyclic Graphs of Tasks), it’s suggested that the memory for the passed data is stored within the successor task itself. Each task object contains the memory for its inputs, as well as a counter that keeps track of how many inputs need to be received before the task can be spawned. In the TBB example, each task also keeps a list of successors, which allows predecessors to write their outputs to their successor’s inputs, and allows the predecessor to decrement their successor’s count of missing arguments (and can finally spawn the successor task if the predecessor finds that it just gave the successor its final missing input.)

The Problem with DAGs

In the TBB DAG example, all tasks are spawned up-front, which is easy to do in their example since the task graph is structured in a formal way. In my case, I want to use a DAG to implement something that looks like a sequence of function calls, except using tasks instead of functions, to allow different parts of the code to be executed in parallel. I want to use a DAG to make it possible to establish dependencies between these sequential tasks, to allow the programmer to create tasks that start when their inputs are available. In short, instead of creating the tasks up front, I want to create the tasks in roughly sequential order, for the purpose of readability.

The problem with what I want to do is that I can’t directly store the outputs of predecessors inside the successor. Since the predecessors need to be passed a pointer to where their outputs should be stored, the successor (which stores the inputs) needs to be allocated before the predecessors. This means you roughly need to allocate your tasks backwards (successor before predecessor), but spawn the tasks forward (predecessors before successors). I don’t like this pattern, since I’d rather have everything in order (from predecessor to successor). It might not be absolutely as efficient, but I’m hoping that the productivity and readability improvement is worth it.

The Interface

Instead of spawning successor tasks up front, I’d like to allocate the storage for the data up front, similarly to Cilk. Cilk has a “cont” qualifier for variables, which can be used to communicate data from one task to another. For example, the fibonacci example from the Cilk paper contains the following code:

cont int x, y;
spawn_next sum(k, ?x, ?y);
spawn fib (x, n-1);
spawn fib (y, n-2);

This code computes fib(n-1) and fib(n-2), then passes the results of those two computations to a sum continuation task, which implements the addition within fib(n) = fib(n-1) + fib(n-2). The data is passed through the x and y variables, which are marked cont. I don’t know why the Cilk authors put the call to spawn sum before the calls to fib in this example, but perhaps this alternate arrangement of the code is possible:

cont int x, y;
spawn fib (x, n-1);
spawn fib (y, n-2);
spawn_next sum(k, ?x, ?y);

With this alternative arrangement of the code, the order of the tasks being spawned mirrors the equivalent code in plain sequential C:

int x, y;
fib(&x, n-1);
fib(&y, n-2);
sum(&k, x, y);

Implementing cont

If the successor task is allocated before the predecessors, the predecessor that supplies that final missing input to the successor can also spawn the successor. However, if the successor is allocated after the predecessors, it’s possible that all predecessors finish their work before the successor is allocated. If we allocate the successor and find that all its inputs are already available, we can simply immediately spawn it. However, if some inputs of the successor are still not available, then we need to find a way to spawn the successor when those inputs become available.

To spawn successors when a cont becomes available, each cont can keep a list of successors that are waiting for it. When the cont is finally set, it can pass that input to its successors, and potentially also spawn successors if this cont was their final missing input. The difficulty with implementing this system lies in resolving the race condition between the predecessor, the cont, and successors.

Here’s an example of a race condition between predecessor/cont/successor. Suppose a successor task is spawned with some cont inputs. The successor might see that a cont input has not yet been set, so the successor adds itself to the list of successors in the cont. However, it might be possible that the cont suddenly becomes set in a different thread while the successor is adding itself to the cont’s successor list. The thread that is setting the cont might run before it sees the new successor added, so it might not notify the successor that the input is now complete (by decrementing its counter and possibly spawning it.)

cont successor linked list

It might be possible to solve the problem described above if it’s possible for the successor to atomically check that the cont is not satisfied and if so add itself to the cont’s list of successors. This might be possible if the cont uses a linked list of successors, since the successor could do a compare-and-swap that sets a new head for the list only if the input is unsatisfied.

If that compare-and-swap fails because the cont became set just before the CAS, the successor can just treat the input as satisfied and move on to checking the next input. If the CAS fails because another successor registered themselves concurrently, then the registration needs to be tried again in a loop. On the other side of that race condition, if the cont’s compare-and-swap to set its completion fails because a successor added themselves to the list concurrently, then the cont just tries again in a loop, which allows the cont to notify the successor that interrupted it of completion.

If the cont becomes set while the successor is hooking up other inputs, the cont will pass the input and decrement the counter, which itself is not a problem. However, the last cont might finish and spawn the successor before the function hooking the successor finishes, which shouldn’t be a problem as long as nothing else needs to happen after the successor finishes hooking itself up to its inputs. If something does need to happen immediately after the hookups, the successor can initialize its counter with an extra 1, which it only decrements at the end of the hookup (and potentially then spawns the task.)

A detail of this implementation is the allocation of the nodes of the linked list. The successor task needs to have allocated with it at least one linked list node per cont. Since this is a fixed number, the allocation can be done more efficiently.

There’s a difficulty with the compare-and-swap, which is that two different compare-and-swaps need to be done. First, the successor should only be added to the list if the cont is not yet set. Second, appending to a linked list in a CAS can fail if two successors try to add themselves to the list simultaneously. To solve this problem, I propose that the cont’s “is set” boolean is stored as the least significant bit of the head of the linked list. This allows operations on the cont to both switch the head of the list and compare and set completeness simultaneously. If pointers are aligned then the least significant bit is always 0, so no information is lost by reusing that bit. We just need to make sure to mask out that bit before dereferencing any pointers in the linked list.

Implementation

I tried scratching up an implementation here: https://github.com/nlguillemot/tbb_future_cont (see tbbtest/main.cpp)

The main additions are as follows:

  • cont_base: Base class for “cont” objects, manages an atomic linked list of successor tasks.
  • cont: Derived from cont_base, stores data in “std::optional-style” to pass data from predecessor to successors.
  • spawn_when_ready: Spawns a task when a list of conts are all set. Interface still a bit low-level.

I’ve only done a few tests, so I don’t know if it’s 100% correct. I only really have one test case, which I’ve debugged in a basic way by adding random calls to sleep() to test different orders of execution. I wouldn’t consider using this in production without a thorough analysis. It’s honestly not that much code, but I’m not an expert on the intricacies of TBB.

Also, I was lazy and used sequential consistency for my atomic operations, which is probably overkill. Any lock-free experts in the house? 🙂 (Update: I’ve replaced the seq_cst with acquires and releases. Fingers crossed.)

I’ve also not sugar-coated the syntax much, so there’s still lots of low-level task management. I’d like to come up with syntax to spawn tasks in a much simpler way than manually setting up tbb::task derivatives, setting reference counts, allocating children, etc. This is a topic for another post.

With this implementation, I was able to build the following task dependency graph, with each edge annotated with its type of of dependency.

graph

Cilk Syntax Study

Note: Previous post on this topic: https://nlguillemot.wordpress.com/2017/01/09/a-task-is-like-a-function/

I’m thinking more about how one can use TBB to write task code that looks similar to existing C code. Of course, people have tried to do this before, and made languages that integrate task parallelism naturally. This article takes a look at these existing solutions, looking for inspiration.

Cilk

Probably the most well-known task-based programming language is Cilk

Here’s an example Cilk procedure (from the paper above):

thread fib (cont int k, int n)
{
  if (n < 2)
  {
    send_argument(k,n);
  }
  else
  {
    cont int x, y;
    spawn_next sum(k, ?x, ?y);
    spawn fib (x, n-1);
    spawn fib (y, n-2);
  }
}

thread sum (cont int k, int x, int y)
{
  send_argument (k, x+y);
}

There’s a certain amount of syntactical sugar here:

  • functions that act as tasks have a “thread” qualifier
  • a “spawn” keyword differentiates spawning child tasks from calling functions
  • a “spawn_next” keyword spawns a continuation task (to spawn more tasks until results arrive)
  • “cont”-qualified variables allow passing data from predecessor to successor task.
  • a built-in “send_argument” sets “cont” variables, and spawns tasks with fully satisfied arguments.
  • a built-in “?” operator allows declaring the dependency of a successor on a predecessor.

This is some pretty cute syntax. My main worry is that there might be overhead in the automation of passing around continuation variables. In contrast, TBB also allows creating continuation tasks, but it requires you to pass continuation arguments by reference manually. For example, TBB users can create a continuation task with inputs as member variables, and the address of these member variables are used as destination addresses for the childrens’ computation. See Continuation Passing. Still, the TBB continuation syntax is pretty tedious to type (and probably error-prone), and I wonder if we can do some C++ magic to simplify it.

The “spawn” and “spawn_next” syntax makes spawning tasks look a lot like calling functions, which is consistent with the goals I described in the previous post. The “cont” variables might be possible to implement by wrapping them in a C++ type, which could implement operator= (or a similar function) for the purpose of implementing an equivalent to “send_argument”. Cilk allows cont variables to be passed as arguments that are declared non-cont (such as sum’s x/y above), and automatically unwraps them when the time comes to actually call the function. In a C++ implementation, this automatic unwrapping might be possible to implement with a variadic function template that unwraps its arguments before passing them to the desired function. If that’s too difficult, we can fall back to defining the continuation function with explicit “std::future”-like arguments, requiring using a special function to unwrap them at the usage site.

I think one of the best things about Cilk is implicitly generating dependencies between tasks by passing arguments. This is much less work and is more maintainable than explicitly declaring dependencies. It does not deal with running two tasks in an order based on side-effects, like if you want printf() calls in two tasks to always happen in the same order. This might be possible to mitigate your in design by factoring out side-effects. Alternatively, we could create a useless empty struct and use that to indicate dependencies while reusing the syntax used to pass meaningful data. This is very similar to the tbb::flow::continue_msg object used in TBB flow graph.

By the way, Cilk’s dependencies are implemented by keeping a counter of missing arguments for tasks. When the counter reaches 0, the task can be executed. This is very similar to how TBB tasks implement child-parent dependencies. The awkwardness is that TBB normally only supports a single parent task, so a task with multiple parents need to be handled specially. See General Acyclic Graphs of Tasks.

Cilk Plus

Cilk Plus is an extension of C/C++ available in some compilers. It enables features similar to Cilk in a way that interops with C/C++. However, instead of any continuation passing, it defines a keyword “cilk_sync”, which waits for all child tasks to finish executing before proceeding. This is probably perfect for fork-join parallelism (a tree of tasks), but I’m not sure if it’s possible to implement a general directed acyclic graph with these features.

ISPC

The ISPC language is mainly useful for writing high-performance SIMD code, but it also defines some built-in syntax for task parallelism. Namely, it supports built-in “task”, “launch”, and “sync” keywords. Again, this seems limited only to fork-join parallelism.

Others?

I’ve seen a few other languages with task-parallelism, but they usually seem to stop at fork/join parallelism, without talking about how continuations or DAGs might be implemented. If you know about a programming language interface that improves on the syntax of Cilk for creating continuations, please tell me about it.

Conclusions

I like Cilk’s ideas for passing data from child task to parent task. Implementing an interface similar to it in C++ using TBB might allow a pretty natural way of implementing task parallelism both for fork/join task trees or more general task DAGs. My main concern is making an interface that makes it easy to do common tasks.

I think that continuation passing might be an elegant way to implement sequential-looking code that actually executes in a DAG-like fashion, which would make it easy to reuse the average programmer’s intuition of single-threaded programming. I want the DAG dependencies to be built naturally and implicitly, similar to how Cilk implements “cont” variables. I want to make it easy to create placeholder “cont” variables that are used only to build dependencies between tasks with side-effects that need to operate in a specific order, similarly to tbb::flow::continue_msg. I also want a way to have a node with multiple parents (to implement general DAGs), and I’d like to minimize the overhead of doing that.

One of my main concerns is how to encapsulate the reference count of tasks. TBB sample programs (in the TBB documentation) all work with reference counts in a pretty low-level way, which may be suitable for when you want to carefully accelerate a specific algorithm, but seems error-prone for code that evolves over time. I hope that this logic can be encapsulated in objects similar to Cilk’s “closure” objects. I think these closure objects could be implemented by creating a derived class from tbb::task, and some C++ syntactical sugar (and maybe macros) could be used to simplify common operations of the task system. From there, I’m worried about the potential overhead of these closures. How can they be allocated efficiently? Will they have a lot of atomic counter overhead? Will their syntax be weird? I’ll have to do some experimentation.

Principles of Compute (Part 2)

Introduction

In the last post, we talked about the motivation to write data-parallel code in order to scale the performance of our programs with hardware resources. We also saw some basic designs for parallel algorithms, mostly in the abstract.

In this post, we’ll go into more detail about what aspects of hardware design exist to increase the performance of data-parallel code. There is a lot of overlap between both CPU and GPU design in this area, so this is quite generally applicable knowledge when writing parallel code. The goal is to identify aspects of hardware design that we can rely on without knowing too much about the underlying architecture, since this allows us to write code that stands the test of time. Naturally, there are still differences between CPU and GPU hardware, so these differences will be highlighted too.

Instruction Parallelism and SIMD

For the purpose of discussion, let’s consider the following parallel loop:

void times_two(float* data, int n)
{
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * 2.0f;
    }
}

Since this is a trivially parallel loop, we can straightforwardly apply parallelism techniques to accelerate it. However, before we actually modify the code ourselves, let’s consider what the processor could do to run this code in parallel for us automatically.

Automatic Instruction Parallelism

In theory, the underlying processor could automatically determine that this code can run in parallel by observing the reads and writes being made. At each iteration, it would see a read from data[i] into a temporary register, some math on that temporary register, then a write back to data[i]. In theory, the processor could internally build dependency graphs that represent the dependencies between all reads and writes, defer the evaluation of this graph until a lot of work has been accumulated, then evaluate the work in the dependency graph by executing different code paths in parallel.

The above can sound a little bit like science fiction, but it does happen to a limited extent in processors today. Processors like CPUs and GPUs can automatically execute instructions in parallel if there does not exist a data hazard between their inputs and outputs. For example, if one instruction reads from memory and a subsequent instruction writes that memory, the processor will wait for the read to finish before executing the write, perhaps using a technique like scoreboarding. If there does not exist such a data hazard, the processor might execute the two instructions in parallel. Additionally, some processors may be able to automatically remove superfluous data dependencies using register renaming, or by making guesses on the future state of the data using speculative execution to avoid having to wait for the results to arrive.

Of course, relying on processor smarts comes at a cost. The hardware becomes more expensive, these features come with their own overhead, and it’s hard to trust that these optimizations are really happening unless you understand the processor at a very deep level (and perhaps have professional tools for verifying it, like Intel VTune). Two instructions that can execute in parallel might also be separated by enough code in between them that the processor is not able to see that they can be executed in parallel, and that the compiler isn’t allowed to perform the optimization safely either.

For example, the following addition of a print statement to “times_two” might make it too complicated for the compiler and processor to safely execute iterations of the loop in parallel, since it can’t know if the implementation of “printf” might somehow affect the contents of the “data” array.

void times_two(float* data, int n)
{
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * 2.0f;
        printf("data[%d] = %f\n", i, data[i]);
    }
}

Manual Instruction Parallelism

In theory, we might be able to help the compiler and processor identify instructions that can run in parallel by explicitly writing them out. For example, the following code attempts (possibly in vain) to help the compiler and processor to recognize that the operations on the data array can be done in parallel, by iterating over it in steps of 4.

void times_two(float* data, int n)
{
    assert(n % 4 == 0);

    for (int i = 0; i < n; i += 4)
    {
        for (int j = 0; j < 4; j++) {
            data[i + j] = data[i + j] * 2.0f;
        }

        for (int j = 0; j < 4; j++) {
            printf("data[%d] = %f\n", i + j, data[i + j]);
        }
    }
}

Assuming the processor can actually understand the intention of this code, the situation is still not great. This setup outputs much more bytecode, which may hurt the efficiency of the instruction cache, and still relies on the processor’s ability to dynamically identify instructions that can execute in parallel.

In the face of the difficulties in automatically executing instructions in parallel, hardware designers have created instructions that allow explicitly declaring operations that run in parallel. These instructions are known as SIMD instructions, meaning “Single Instruction Multiple Data”. As hinted by “multiple data”, these instructions are very well suited to exploit data-parallelism, allowing either the compiler or the programmer to assist the processor in recognizing work that can be done in parallel.

SIMD instruction sets include parallel versions of typical arithmetic instructions. For example, the ADDPS instruction on Intel processors computes 4 additions in one instruction, which allows explicit indication to the processor that these 4 additions can be executed in parallel. Since this ADDPS instruction needs quadruple the inputs and outputs, it is defined on 128-bit registers as opposed to the typical 32-bit registers. One 128-bit register is big enough to store four different 32-bit floats, so that’s how the quadrupled inputs and outputs are stored. You can experiment with SIMD instructions sets using so-called compiler intrinsics, which allow you to use these SIMD instructions from within your C/C++ code as if they were ordinary C functions. You can use these by including a compiler-supplied header like xmmintrin.h.

As an example of applying SIMD, consider the reworked “times_two” example:

void times_two(float* data, int n)
{
    assert(n % 4 == 0);

    // two = { 2, 2, 2, 2 }
    __m128 two = _mm_set1_ps(2.0f);

    for (int i = 0; i < n; i += 4)
    {
        __m128 x = _mm_load_ps(&data[i]);
        x = _mm_mul_ps(x, two);
        _mm_store_ps(&data[i], x);
    }
}

This code implements the times_two example at the start of this section, but computes 4 additions in every iteration. There are a few complications:

  • The syntax is ugly and verbose.
  • It requires the input array to have a multiple of 4 in size (unless we add a special case.)
  • It requires the input array to be aligned to 16 bytes (for best performance.)
  • The code is less portable.

Writing code that assumes a SIMD width of 4 is also non-ideal when you consider that newer Intel processors can do 8-wide or 16-wide vector operations. Different GPUs also execute SIMD code in a variety of widths. Clearly, there are many good reasons to want to abstract the specifics of the instruction set in order to write more portable code more easily, which can be done automatically through the design of the programming language. This is something we’ll talk about later.

I hope we can agree that SIMD adds another level to our data-parallel decomposition. Earlier, we talked about splitting work into data-parallel pieces and distributing them among cores, and now we know that we can get more parallelism by running SIMD code on each core. SIMD compounds with multi-core in a multiplicative way, meaning that a 4-core processor with 8-wide SIMD has a maximum theoretical performance improvement of 4 * 8 = 32.

Memory Access Latency Hiding

With the compute power of multi-core and SIMD, our execution time quickly becomes bound by the speed of memory access rather than the speed of computation. This means that the performance of our algorithm is mostly dictated by how fast we can access the memory of the dataset, rather than the time it takes to perform the actual computation.

In principle, we can use our newly found computation power of multi-core and SIMD to lower the overhead of memory access by investing our spare computation in compression and decompression. For example, it may be highly advantageous to store your data set using 16-bit floating point numbers rather than 32-bit floating point numbers. You might convert these floats back to 32-bit for the purpose of running your math, but it might still be a win if the bottleneck of the operation is memory access. This is an example of lossy compression.

Another way to deal with the overhead of memory access is to create hardware which hides the latency of memory access by doing compute work while waiting for it. On CPUs, this is done through Hyper-Threading, which allows you to efficiently create more threads than there physically exists cores by working on the second thread of work where the core would normally wait for one thread’s memory access to complete.

A significant factor of improving the latency of memory access lies in the design of your algorithm and data structures. For example, if your algorithm is designed to access memory in predictable patterns, the processor is more likely to guess what you’re doing and start fetching memory ahead of your requests, which makes it more likely to be ready when the time comes to access it. Furthermore, if successive memory operations access addresses that are close to each other, it’s more likely that the memory you want is already in cache. Needless to say, cache coherency is very important.

Beyond cache effects, you may also be able to hide the latency of memory access by starting many memory operations as early as possible. If you have a loop that loops 10 times and does a load from memory at the beginning of every loop, you might improve the performance of your code by doing the loads “up front”. For example:

void foo()
{
    int loads[10];

    for (int i = 0; i < 10; i++) {
        loads[i] = load_thing(i);
    }

    for (int i = 0; i < 10; i++) {
        do_thing_with_load(loads[i]);
    }
}

This optimization is more likely to be beneficial on an in-order core rather than an out-of-order core. This is because in-order cores execute instructions in the order in which they are read by the program, while out-of-order cores execute instructions based on the order of data-dependencies between instructions. Since out-of-order cores execute instructions in order of data-dependency, the loads that were pulled out of the loop might not execute early despite our best efforts. Instead, they might only execute at the time where the result of the load is first used.

Optimizations based on playing around with the order of instructions is generally less useful on out-of-order cores. In my experience, these optimizations on out-of-order cores often having no noticeable effect, and are often even being a detriment to the performance. CPU cores these days are typically out-of-order, but GPU cores are typically in-order, which makes these kinds of optimizations more interesting. An optimization like the one above might be already done by the GPU shader compiler without your help, but it might be worth the experiment to do it yourself, by either manually changing the code or by using compiler hints like HLSL’s “loop unroll” hint.

That said, I’ve seen cases where manually tweaking memory access has been extremely beneficial even on modern Intel processors. For example, recent Intel processors support a SIMD instruction called “gather” which can, for example, take as input a pointer to an array and four 32-bit indices into the array. The gather instruction performs these multiple indexed loads from the array in a single instruction (by the way, the equivalent for parallel indexed stores to an array is called “scatter”). As expected from its memory access, the gather instruction has a relatively high latency. This can be a problem in the implementation of, for example, perlin noise. Perlin noise implementations use a triple indirection into an array to produce seemingly random numbers, in the style of “data[data[data[i] + j] + k]”. Since each gather depends on the result of a previous gather, the three gathers need to happen completely sequentially, which means the processor basically idles while waiting for 3x the latency of memory access of gather. Manually factoring out redundant gathers or tweaking the perlin noise algorithm can get you a long way.

On the topic of dependent loads, a common target of criticism of memory access patterns is the linked list. If you’re traversing a linked list made out of nodes that are randomly scattered in memory, you can certainly expect that the access of each successive linked list node will cause a cache miss, and this is generally true. However, there’s another problem relating to linked lists, related to memory access latency. This problem comes from the fact that when iterating over a linked list, there’s no way for the processor to go to the next node until the address of the next node is loaded. This means you can bear the full brunt of memory access latency at every step of the linked list. As a solution, you might want to consider keeping more than one pointer per node, for example by using an octree rather than a binary tree.

CPUs vs GPUs?

I’ve been mostly talking CPUs so far, since they’re generally more familiar and easier to work with in example programs. However, pretty much everything I’ve said generally applies to GPUs as well. GPUs are also multi-core processors, they just have smaller cores. The cores on GPUs run SIMD code similarly to CPUs, perhaps with 8-wide or 16-wide SIMD. GPUs have much smaller caches, which makes sense, since synchronizing caches between multiple processors is inherently a very ugly problem that works against data-parallelism. By having smaller caches, GPUs pay a much higher cost for memory access, which they amortize by heavy use of executing spare computation while waiting for their memory access to complete, very similar in principle to hyper-threading.

Conclusion/To Be Continued

In the next post, we’ll talk about how the data-parallel uses of multi-core and SIMD are abstracted by current compute languages, so we can finally understand the idiosyncracies of compute languages.

Principles of Compute (Part 1)

Introduction

The free lunch is over, and all software engineers need to understand how to write parallel code if they want to continue seeing performance improvements with future CPUs. Similarly, we have to think parallel if we want to take advantage of massively parallel computation resources like GPUs or FPGAs.

To fully exploit these hardware resources, people started designing so-called compute languages. For example, the CUDA language allows you to write massively parallel programs to run on NVIDIA GPUs, and the OpenCL language allows you to write massively parallel programs that can efficiently target CPUs, GPUs, and FPGAs. Graphics APIs like OpenGL and DirectX also allow you to write massively parallel geometry and pixel processing programs that run on GPUs, and on CPUs too with software renderers like WARP.

In recent years, OpenGL and DirectX have embraced so-called “compute shaders”, which allow general-purpose compute programs to run alongside the traditional geometry and pixel processing programs. Compute shaders brought significant innovation to renderer design, and are now an integral part of advanced rendering engines.

So Why Do We Care?

Let’s take a step back and think. Everything we’re talking about here is strictly for the purpose of optimization. Compute programs don’t let you implement logic that wasn’t possible to implement previously. At best, compute languages let you implement the same logic more efficiently, at the likely cost of more development time. Of course, performance is still very important. If we didn’t care about performance, we wouldn’t be using GPUs at all, but if you’re a gamer or a game developer, you probably know how important high-performance GPUs are for 3D gaming graphics to work at enjoyable frame rates.

The reason I’m emphasizing the importance of optimization is that writing efficient programs is highly dependent on your understanding of the relevant computer hardware. We’ve entered the domain of embedded systems, meaning that we are designing software systems with the intention of running them on a specific kind of hardware. As a software engineer, you are solving a problem (“I want to make such-and-such video game”), and you must solve this problem with a hybrid hardware-software solution. Hardware details are more than just a good conversation starter at software development conferences… Understanding the hardware is your job, and it’s the reason employers will pay top dollar to borrow your brain. It’s the argument of kings: This is embedded systems. We care.

Why Read This Article

The goal of this article is to gain an intuitive understanding of how compute works. Instead of memorizing arbitrary OpenGL/OpenCL/DirectX/CUDA terminology, we’ll put ourselves in the shoes of somebody who wants to write programs that fully exploit hardware resources, and we’ll see how this leads to the design of the compute languages we know and love today.

By understanding the principles behind the API, you can mostly spare yourself from having to memorize any API details at all. You can make your design on paper independently from whether you’re using OpenGL or DirectX or whatever, then you can translate it to whatever API-specific syntax when the time comes to implement your design.

Furthermore, if you do try to understand compute through only (eg) OpenGL’s documentation, you might find it very difficult. The specifications of compute languages are vague by nature, since they give generous flexibility to be implemented on widely different hardware. This vagueness makes it hard to understand these specifications… Unless you already have a lot of background knowledge about the underlying hardware, in which case you probably barely need to read the specification in the first place. To deal with this vagueness, this article tries to teach this required background knowledge that specifications otherwise omit.

Finally, it’s important to realize that contemporary compute APIs like OpenGL and DirectX are evolving at a fast pace. In 5 years, compute won’t look the same. In 10 years, compute capabilities will have evolved significantly in flexibility and performance. In 20 years, what we’re doing now will look primitive and uncivilized. If you’re reading this now, you’re probably going to see this evolution first-hand. Maybe you’ll be a user, and maybe you’ll be an implementer too. Either way, you have the potential to shape the future, so let’s think about how things should be rather than how things are.

Scaling with Hardware

Previously, developers could write non-parallel (scalar) code, and see their programs run significantly faster with every new generation of hardware. Unfortunately, as CPU designers run up against the physical limits of the universe, it has become more cost-efficient to, for example, create a CPU with two cores at the same (or lower) frequency, rather than trying to create a CPU with a single core at double the frequency.

There is an obvious trend of hardware designers deciding to replicate hardware resources rather than trying to speed up existing hardware resources. As before, we would like to write programs that improve in performance over time with new generations of hardware. To do this, we have to write programs that improve in performance as hardware resources become more and more replicated in the future.

Replicated hardware resources allow us to improve the performance our programs, but only in a specific kind of way. Seeing this requires a shift in mindset. Consider for example using a binary search tree or a hash table to optimize a linear search. Thinking about data structure improvements to improve search time is certainly important, but we also have to consider if and how these data structures would speed up the program if the user ran it with additional hardware resources.

In contrast to single-threaded optimizations, scaling with additional hardware requires distributing the data among available hardware resources. For example, with replicated comparison circuits, we can accelerate comparisons while searching by doing more than one comparison in parallel. With replicated arithmetic hardware, we can compute the hash code for more than one search query simultaneously. With replicated hardware for memory access, we can accelerate the access of the data structure we’re searching in. What these optimizations have in common is that they distribute the data of the input domain among hardware resources, an approach formally known as data-parallelism. Identifying ways to run your code in a data-parallel way is the key to writing code that scales with hardware.

The Red Herring: System-on-a-Thread

In the earlier days of game engine programming, developers were thinking about scaling with hardware by associating software systems to cores. For example, a game programmer might have decided to exploit a dual core CPU by running their graphics system on one core and their audio system on the second core, which is done by creating one operating system thread for each of these systems. This so-called “system-on-a-thread” solution has some serious scalability problems. If the audio thread finishes processing one frame of audio before the graphics thread finishes processing one frame of video, then the audio thread has to become idle while waiting for the graphics thread to catch up. In that period of time, only one out of two CPU cores is active, which means that computation power is left on the table.

Idling cores isn’t the only problem with system-on-a-thread. What happens if you want to add a physics system to the game engine, and the physics system also wants to run on its own thread? In this case, we have 3 system threads but only 2 CPU cores. This means we have over-subscription, which means that we have more threads than cores. This is undesirable, since the CPU cores have now become contended resources between the threads.

One core can only run one thread at a time, so the operating system will need to share each core’s running time among the threads assigned to it. When the operating system switches a core from running one thread to running another thread, an expensive context-switch happens. In the worst case, your program might spend more time context-switching between threads than doing useful work.

There is an important lesson to learn from this: Additional parallelism in software can only be useful if it can be efficiently allocated to hardware resources. When it comes down to it, software systems can’t use more circuits than what physically exists on the chip. Only hardware designers have the power to add additional circuits, and as a software designer you have to work with what you’re given.

The problems with system-on-a-thread underline the importance of data-parallelism. If we identify general ways of splitting up the data we’re working on among hardware resources, then we can efficiently split up work between available hardware, independently of how much hardware we really have.

Designing Parallel Programs

To design data-parallel programs, we need to understand what can and can’t be made parallel. As implied by the term “data-parallelism”, the goal is to divide the work into pieces that can be executed independently on the data that needs to be processed, which allows us to execute these pieces of work in parallel. By designing work that can run independently from other work, we guarantee that the execution of one piece of work won’t affect the results of another piece of work. This is important, since work that executes in parallel will likely not execute in a consistent order.

The Problem with Threads

Let’s take a moment to appreciate the problems that arise when parallel code doesn’t run in a consistent order.

Consider the following example program:

/* Thread 1 */           |  /* Thread 2 */
a = 1;                   |  a = 0;
printf("a == %d\n", a);  |

This program is made up of two threads. The order in which the two threads execute their code is unpredictable, since it depends on the inner workings of the C compiler, the operating system and the hardware.

In one run of the program, the execution might happen in the following order:

a = 0;
a = 1;
printf("a == %d\n", a); // -> prints "a == 1"

In this case, Thread 2 ran its first line of code first, then Thread 1 ran to completion. As a result, the program finished with “a” set to 1.

In another run of the program, the execution might happen in this different order:

a = 1;
a = 0;
printf("a == %d\n", a); // -> prints "a == 0"

In this case, Thread 1 ran its first line of code, then Thread 2 ran its first line of code, then Thread 1 ran its second line of code. As a result, the program finished with “a” set to 0.

What this short example shows is that multi-threaded programs can produce different results based on the order in which the code executes on different threads. This unpredictability is bad news for programmers, since it makes the code much harder to understand. To quote Edward A. Lee’s “The Problem with Threads”:

a folk definition of insanity is to do the same thing over and over again and expect the results to be different. By this definition, we in fact require the programmers of multi-threaded systems be insane. Were they sane, they could not understand their programs.

When the order of execution of different threads causes the results of the program to change, we call that a “race condition”, to mean that the conditions in which the program’s results differ if one thread reaches a certain point of execution before another. If we want to resolve a race condition, we can identify synchronization points in the code, which can be implemented using an operating system synchronization object like a mutex, a semaphore, a condition variable, a critical section, an interlocked (atomic) operation, and more.

The synchronization objects I just described are very interesting to study and extremely useful, but they also all have relatively high execution costs. Instead, we should aim to avoid needing synchronization at all, and this can be achieved if we design algorithms that avoid race conditions entirely. To do this, we first need to know how to formally identify race conditions.

Identifying Race Conditions

As we saw in the previous section, running two functions in parallel may result in race conditions, which can lead to programs with indeterminate results. There are two main culprits to this race condition:

  • Shared memory
  • Side-effects

Shared memory was a problem in the previous example, since the race condition we observed was partly caused by the fact that the memory for the variable “a” was shared between the two threads. If the code was re-designed to make the variable “a” only exist in one thread (as opposed to being shared between the two threads), then it would be impossible for the variable to change due to the execution of a different thread, since those other threads simply don’t have access to it.

Along with shared memory, the second cause of race conditions in the example we saw was that each thread had side-effects. For example, consider the following program:

/* Thread 1 */           |  /* Thread 2 */
b = a;                   |  c = a;
b += 1                   |  c -= 1;

In this program, both threads had access to the variable “a” through shared memory. However, unlike the program with a race condition that we saw earlier, the two threads only read the value of “a”, and never write to it. Since the shared memory is now read-only, there is no race condition. Regardless of the order in which the lines of code of the two threads above are executed, the final values of “a” and “b” and “c” are always the same.

Based on these observations, we can make a powerful statement: Threads may share memory without race conditions and without synchronization if the memory they share is read-only.

The fact that memory can be safely shared between threads if it is read-only is very interesting from the perspective of a programming language designer. For example, the Haskell programming language is designed so every variable is read-only after being written for the first time. Since this prevents functions from having side-effects, memory can be shared arbitrarily and Haskell programs can thus be highly automatically parallelized. Haskell has other problems that make it arguably hard to use in real-life code, but it’s interesting and eye-opening to study for programmers of all walks of life.

Despite the elegance of languages like Haskell, our day-to-day life using programming languages like C requires us to both read and write shared memory. If we want to avoid race conditions in our programs, then we should ensure that these reads and writes happen in a consistent order. To do this, we must identify and handle the data hazards that happen when reads and writes are done to shared memory. For example, if one thread writes to shared memory then another thread reads that shared memory, that subsequent read should see the value that was just written. In other words, before being able to subsequently read from memory after a write, the reader must wait for the previous writer to finish writing. If the reader doesn’t wait for the writer to finish, the reader might read the previous value of that memory (as opposed to the new one it should have read). Even worse, the reader might read some data that is a Frankenstein-like half-half between the previous value and the new value, also known as a “torn read”.

Since data-hazards are concerned with handling the order of reads and writes, we can exhaustively analyze every possible such scenario:

  1. Read-After-Read (RAR)
  2. Read-After-Write (RAW)
  3. Write-After-Read (WAR)
  4. Write-After-Write (WAW)

The case of Read-After-Read is the one scenario where there is no data hazard. As described previously, there are no race conditions when shared memory is only read and never written.

In the case of Read-After-Write, it’s important for the reader to wait for the writer to finish, otherwise the reader might not see the updated value, or might get a torn read, as was explained earlier.

In the case of Write-After-Read, it’s important for the writer to wait for previous readers to finish, since otherwise the readers might see the new value instead of the old value they were supposed to see. If the writer doesn’t wait for readers to finish, it’s also possible for the writer to cause a torn read by overwriting a value while a reader is only halfway done reading it.

Finally, in the case of Write-After-Write, it’s important that the final value of the data reflects the second write instead of the first write. If this relationship isn’t enforced, successive reads will see the old value (from the first write) instead of the new value (from the second write). Therefore, a writer should wait for previous writers to finish before doing their own write.

Now that we’ve identified all possible data hazards that can cause race conditions, we can more formally identify race conditions in multi-threaded code. Awareness of these data hazards will allow us to design algorithms that avoid them when possible. Furthermore, when we can’t avoid a data hazard, we’ll know that we need to use some kind of synchronization to make sure reads and writes happen in the right order.

Designing Data-Parallel Algorithms

With the knowledge of data hazards, we can now think about designing algorithms that avoid them. This section reviews a variety of methods useful for designing data-parallel algorithms.

Independent Uniform Work

As an initial example, let’s consider the following algorithm:

void times_two(float* data, int n)
{
    for (int i = 0; i < n; i++)
    {
        data[i] = data[i] * 2;
    }
}

Running this code in parallel is fairly easy. Each iteration of the loop is independent from the other iterations, so we can simply distribute the “n” iterations equally among the hardware’s available resources, and that should give ideal hardware usage. This is possible because no memory is shared between the different iterations of the loop, which means that no synchronization is necessary and that the code can be trivially parallelized. We can parallelize the above code among the CPU’s cores trivially with an API like OpenMP, as follows:

void times_two(float* data, int n)
{
    // Runs the for loop below in parallel, distributed among CPU cores.
#pragma omp parallel for
    for (int i = 0; i < n; i++)
    {
        data[i] = data[i] * 2;
    }
}

By the way, OpenMP is a C/C++ compiler extension that allows you to automatically parallelize code. OpenMP is built into most C++ compilers these days, including GCC, clang, and Visual C++. You can enable OpenMP by simply passing a compiler flag on the command line, or by setting a project property in Visual Studio, so you should feel free to experiment with it.

Non-Independent Uniform Work

Let’s consider a slightly more challenging example: An algorithm that “blurs” an array. As a reminder, a basic box blur (ie. on an image) is implemented by computing the average of a pixel and its neighbors. In this case, we’ll take the average of a float and its 2 neighbors in the 1D array (the float before and the float after).

If we wanted to blur a single float in the data array, we might do something as follows (ignoring bounds checking):

    data[i] = (data[i - 1] + data[i] + data[i + 1]) / 3.0f;

The code above cannot simply be used as the body of a loop over the whole array, because the iterations are not independent from each other. The order in which the iterations are executed will change the output of the algorithm, since the “data[i]” written in one iteration will be read by its neighbors, which will change the result of its neighbors’ computations. If we write to data[i] only after the the reads from data[] have been done for all iterations, then the answer will be correct, since the writes will not interfere with the reads. We could see this problem as a Write-After-Read data hazard, since we need to wait for all the reads from data[] to be complete before we do any writes to it.

The Write-After-Read data hazard happens in this example because we’re trying to read and write to the same memory location in a single step. To remove the data hazard, we can use an additional memory allocation. By creating a separate array to store the results, there are no more data hazards. The input array is read-only, and the output[i] is not shared between iterations, so the code can now be easily distributed among available hardware resources in parallel.

Note 1: It is assumed that the “input” and “output” arrays are not overlapping in memory. For example, if both pointers pointed to the same memory, we would have the exact same problem as before.

Note 2: For readability, the bounds-checking for the beginning and end of the array is omitted. If necessary, this could be fixed by handling the first and last iteration as a special case, or by making the data[] array bigger by adding one element before the beginning and one after the end.

void blur(const float* input, float* output, int n)
{
#pragma omp parallel for
    for (int i = 0; i < n; i++) {
        output[i] = (input[i - 1] + input[i] + input[i + 1]) / 3.0f;
    }
}

Ping-ponging

To take the blur example one step further, let’s consider how we can make a “blurrier” result by blurring the results of the blur. In other words, we want to run the blur algorithm multiple times on the array in a loop. We can easily implement this using the blur function defined previously, by alternating between which buffer is used as input and which buffer is used as output. This could be done as follows:

void blur_4_times(float* input, int n)
{
    // note: assuming borders are handled properly inside blur()
    float* tmp = new float[n];

    float* read_buffer = input;
    float* write_buffer = tmp;

    for (int blur_i = 0; blur_i < 4; blur_i++)
    {
        blur(read_buffer, write_buffer, n);
        std::swap(read_buffer, write_buffer);
    }

    delete[] tmp;
}

This code alternates between reading and writing between two buffers. After each iteration, the buffer being read from and the buffer being written to are swapped. This is known as “ping-ponging”, to mean that the direction of the copying between the two buffers is alternating like how a ping-pong ball alternates between players on a ping-pong table. 🙂

Reduction

In the next example, we’ll be looking at how to implement a reduction. A reduction is an operation that turns a set of values into a single value, like for example computing the minimum element in an array. Computing the minimum of an array sequentially without any multi-threading is pretty darn easy. For reference, it might look something like this:

float minimum(const float* data, int n)
{
    // result is undefined if the array passed in is empty
    assert(n >= 1);

    float curr_minimum = data[0];
    
    for (int i = 1; i < n; i++)
    {
        if (data[i] < curr_minimum) {
            curr_minimum = data[i];
        }
    }

    return curr_minimum;
}

In the algorithm above, “curr_minimum” is potentially read and written by every iteration for the loop, meaning there exists a data hazard between each iteration of the loop. This data hazard prevents us from safely running this loop in parallel. In order to find the minimum in parallel, we can implement this algorithm in multiple passes. The big idea is to split the array into many smaller arrays, then find the minimum of each of these smaller arrays in parallel. Once the minimums of the smaller arrays are computed, we can find the minimum of these minimums. This subdivision of the work can be applied repeatedly in order to implement a highly parallel minimum algorithm, as shown below.

float minimum(const float* data, int n)
{
    assert(n >= 1);
    assert(is_power_of_two(n)); // for simplicity

    float* mins = new float[n];
    memcpy(mins, data, sizeof(float) * n);

    int curr_n = n;

    for (int pass = 0; pass < log2(n); pass++)
    {
        #pragma omp parallel for
        for (int i = 0; i < curr_n / 2; i++)
        {
            mins[i] = std::min(mins[2 * i], mins[2 * i + 1]);
        }

        curr_n = curr_n / 2;
    }

    float result = mins[0];

    delete[] mins;

    return result;
}

In the example above, the computation shrinks by half with each pass, since each pass computes the minimum for a pair of inputs from the previous iteration. The number of passes required can be lowered if each iteration of the inner loop computes the minimum of more than 2 values, so you can make a tradeoff between the number of passes and the amount of serial work per pass. This is something you need to tweak based on the hardware you’re working with and the size of your input.

Conclusion/To Be Continued

In this post, the motivation for designing data-parallel algorithms was established, based on the idea that we need to write software that can scale in performance as more hardware (cores, ALUs, etc) is added to processors. From there, we explored the concept of data hazards, showing how they are an obstacle to the implementation of data-parallel algorithms. Finally, a few common ways to implement data-parallel algorithms have been introduced.

In the next post, I’ll talk about how hardware (CPUs, GPUs, etc) can be designed to enable to rapid execution of data-parallel algorithms. Once we understand how data-parallel programs work from the conceptual side and from the hardware side, we’ll be able to talk about designing programming languages that allow efficient implementation of data-parallel algorithms based on principles that can be applied to a wide variety of hardware.

A Task Is Like A Function

In order to effectively use a multi-core CPU, a decomposition of code into tasks must be made. By decomposing code into tasks, the tasks can be distributed among CPU cores by a scheduler. Tasks can have dependencies on other tasks, meaning that the tasks execute in a partial order. This partial order can be represented as a graph, where each task has predecessors and successors. A task can only execute when all its predecessors have completed.

To implement this, we can explicitly create a graph of tasks, with edges between tasks to identify predecessor/successor relationships. The problem with a general task graph system is that it’s not productive to work with. Maybe it’s okay for some purposes to have a GUI interface to build a task graph, but that’s too tedious for everyday code. Imagine if you had to explicitly create nodes and edges when implementing a single-threaded function call graph, it would be so tedious! So, how can a productive task graph interface be built?

Languages like C/C++ were originally designed for function decomposition. That means you split your code into functions, and call functions from other functions. eg:

void a() {
    printf("a");
    b();
}
void b() {
    printf("b");
}
void main() {
    a();
    b();
}

This is the natural way to write C/C++ code, and this is how single-threaded game engines are written. In the task-parallel world, we want to enhance function decomposition by adding task decomposition. In other words, we structure the code as tasks that spawn other tasks.

There are benefits of matching task decomposition to function decomposition:

  • It matches the natural code structure of C/C++ (tasks spawn tasks vs functions call functions)
  • It provides a clear upgrade path for single-threaded code (child functions become child tasks)
  • We can leverage C/C++ programmers’ built-up intuition on how to write code.

So, what’s the difference between tasks and functions?

  • Functions are called, then return. Tasks are spawned, then complete.
  • Tasks are a superset of functions. A function is task that is spawned and waiting for completion immediately.
  • Task spawning is more relaxed than function call.
    • A spawned task may not necessarily begin until its completion is explicitly waited on.
    • Note: “enqueued” tasks (vs. “spawned”) don’t need an explicit wait, but still start asynchronously.
  • Task completion is more relaxed than function return.
    • A task may not necessarily complete until its completion is explicitly waited on.

The relaxations of tasks allow them to run in parallel, but this relaxation must be constrained to enforce a partial order on execution. The requirement of partial order comes in 3 main forms, each more general than the last:

  • Task Path: 1 predecessor, 1 successor.
  • Task Tree: N predecessors, 1 successor.
  • Task Directed Acyclic Graph (DAG): N predecessors, N successors.

The Task Path can be implemented as an ordinary sequence of function calls. This may also be integrated in the task system as a sequence of continuation tasks, where each task defines its own continuation.

The Task Tree is implemented as a task that spawns a group of child tasks, then waits for them to complete. This is good for recursive divide-and-conquer algorithms, allowing tasks to be created dynamically based on the distribution of work.

The Task DAG is implemented by spawning tasks only when all their predecessors have completed. This can be implemented by storing a counter of incomplete predecessors in the successor task. When a predecessor task completes, it decrements the counter in each successor. The predecessor that decrements the counter to zero then spawns the successor.

These three forms of task relationships can be implemented using Intel TBB’s task system.

Note: TBB’s task-parallelism enables multi-core scaling through a work-stealing scheduler. To better understand this, see:

The Task Path is implemented either as an ordinary sequence of function calls (independent of TBB), or using continuation tasks.

The Task Tree is implemented by spawning child tasks then waiting for them to finish. It’s the poster-child use case of TBB tasks.

The Task DAG is implemented by reusing the reference count built into TBB tasks as the number of incomplete predecessors. The predecessor tasks must decrement and check the reference count and possibly spawn each successor, as explained previously.

It seems like we have all the building blocks to build task graphs with ordinary TBB tasks, but the API is a bit too low level. It may be possible to build a simpler interface on top of TBB, that would also simplify common use cases. I would like this API to look closely like an ordinary call graph, for the reasons stated at the start of this writeup. Designing this API is a topic for another post.

PS: TBB is designed for non-blocking tasks, meaning it is not suitable for performing I/O or waiting for asynchronous operating system events. With Intel TBB’s task scheduler, a blocking task results in an idle core in the best case, and a deadlock in the worst case. To handle asynchronous I/O and events, I would recommend using the Windows Threadpool task system, since its integration in the operating system helps it make intelligent scheduling decisions by knowing about the state of I/O operations and events. TBB tasks can be executed asynchronously in response to OS events using enqueued tasks.