Introduction
Direct3D 12 opens up a lot of potential by making it possible to write GPU programs that make use of multiple GPUs. For example, it’s possible to write programs that distribute work among multiple GPUs from linked GPUs (eg: NVIDIA SLI or AMD Crossfire), or even between GPUs from different hardware vendors.
There are many ways to make use of these multi-adapter features, but it’s not obvious yet (at least to me) how to best make use of it. In theory, we should try to make full use of all available hardware on a given computer, but there are difficult problems to solve along the way. For example:
- How can we schedule GPU tasks to minimize communication overhead between different GPUs?
- How can we distribute tasks among hardware that vary in performance?
- How can we use special hardware features? eg: “free” CPU-GPU memory sharing on integrated GPUs.
D3D12 Multi-Adapter Features Overview
To better support multiple GPUs, Direct3D 12 brings two main features:
- Cross-adapter memory, which allows one GPU to access memory of other another GPU.
- Cross-adapter fences, which allows one GPU to synchronize its execution with another GPU.
Working with multiple GPUs in D3D12 is done explicitly, meaning that sharing memory and synchronizing GPUs must be taken into consideration by the rendering engine, as opposed to being “automagically” done inside GPU drivers. This should lead to more efficient use of multiple GPUs. Furthermore, integrating shared memory and fences into the API allows you to avoid making round-trips to the CPU to interface between GPUs.
For a nice quick illustrated guide to the features described above, I recommend the following article by Nicolas Langley: Multi-Adapter Support in DirectX 12.
D3D12 supports two classes of multi-adapter setups:
- Linked Display Adapters (LDA) refers to linked GPUs (eg: NVIDIA SLI/AMD Crossfire). They are exposed as a single
ID3D12Device
with multiple “nodes”. D3D12 APIs allow you to specify a bitset of nodes when the time comes to specify which node to use, or which nodes should share a resource. - Multiple Display Adapters (MDA) refers to multiple different GPUs installed on the same system. For example, you might have both an integrated GPU and a discrete GPU in the same computer, or you might have two discrete GPUs from different vendors. In this scenario, you have a different
ID3D12Device
for each adapter.
Another neat detail of D3D12’s multi-adapter features is Standard Swizzle, which allows GPU and CPU to share swizzled textures using a convention on the swizzled format.
Central to multi-adapter code is the fact that each GPU node has its own set of command queues. From the perspective of D3D12, each GPU has a rendering engine, a compute engine, and a copy engine, and these engines are fed through command queues. Using multiple command queues can help the GPU schedule independent work, especially in the case of copy or compute queues. It’s also possible to tweak the priority of each command queue, which makes it possible to implement background tasks.
Use-Cases for Multi-Adapter
One has to wonder who can afford the luxury of owning multiple GPUs in one computer. Considering that multi-adapter wasn’t properly supported before D3D12, it was probably barely worth thinking about, other than scenarios explicitly supported by SLI/Crossfire. In this section, I’ll try to enumerate some scenarios where the user might have multiple GPUs.
“Enthusiast” users with multiple GPUs:
- Linked SLI/Crossfire adapters.
- Heterogeneous discrete GPUs.
- Integrated + discrete GPU.
“Professional” users:
- Tools for 3D artists with fancy computers.
- High-powered real-time computer vision equipment.
“Datacenter” users:
- GPU-accelerated machine-learning.
- Engineering/physics simulations (fluids, particles, erosion…)
Another potentially interesting idea is to integrate CPU compute work in DirectX by using the WARP (software renderer) adapter. It seems a bit unfortunate to tie everyday CPU work into a graphics API. I guess it might lead to better CPU-GPU interop, or it might open opportunities to experiment with moving work between CPU and GPU and see performance differences. This is similar to using OpenCL to implement compute languages on CPU.
Multi-adapter Designs
There are different ways to integrate multi-adapter into a DirectX program. Let’s consider some options.
Multi-GPU Pipelining
Pipelining with multiple GPUs comes in different flavors. For example, Alternate Frame Rendering (AFR) consists of alternating between GPUs with each frame of rendering, which allows multiple frames to be processed on-the-fly simultaneously. This kind of approach generally requires the scene you’re rendering to be duplicated on all GPUs, and requires outputs of one frame’s GPU to be copied to the inputs to the next frame’s GPU.
AFR can unfortunately limit your design. For example, dependencies between frames can be difficult to implement efficiently. To solve this problem, instead of pipelining at the granularity of frames with AFR, one might pipeline within a frame. For example, half of the frame can be processed on one GPU, then finished on another GPU. In theory, these pipelining approaches should increase throughput, while possibly increasing latency due to the extra overhead of copying data between GPUs (between stages of the pipeline.) For this reason, we have to be careful about the overhead of copies
A great overview of multi-adapter, AFR, and frame pipelining was given in Juha Sjöholm’s GDC 2016 talk: Explicit Multi GPU Programming with DirectX 12
Task-Parallelism
With a good data-parallel division of our work, we can theoretically easily split our work into tasks, then distribute them among GPUs. However, there’s fundamentally a big difference in the ideal level of granularity of parallelism between low-latency (real-time) users and high-throughput (offline) users. For example, work that can be done in parallel within one frame is not always worth running on multiple GPUs, since the overhead of communication might nullify the gains. In general:
- Real-time programs don’t have much choice outside of parallelism within one frame (or a few frames), since they want to minimize latency, and they can’t predict future user controller inputs anyways.
- Offline programs might know the entire domain of inputs ahead of time, so they can arbitrarily parallelize without needing to use parallelism within one frame.
If our goal is to render 100 frames of video for a 3D movie, we could split those 100 frames among the available GPUs and process them in parallel. Similarly, if we want to run a machine learning classification algorithm on 1000 images, we can also probably split that arbitrarily between GPUs. We can even deal with varying performance of available GPUs relatively easily: Put the 1000 tasks in a queue, and let GPUs pop them and process them as fast as they allow, perhaps using a work-stealing scheduler if you want to get fancy with load-balancing.
In the case of a real-time application, we’re motivated to use parallelism within each frame to bring content to the user’s face as fast as possible. To avoid the overhead of communication, we might be motivated to split work into coarse chunks. Allow me to elaborate.
Coarse Tasks
To minimize the overhead of communication between GPUs, we should try to run large independent portions of the task graph on the same GPU. Parts of the task graph that run serially are an obvious candidate for running on only one GPU, although you may be able to pipeline those parts.
One way to separate an engine into coarse tasks is to split them based on their purpose. For example, you might separate your project into a GUI rendering component, a fluid simulation component, a skinning component, a shadow mapping component, and a scene rendering component. From there, you can roughly allocate each component to a GPU. Splitting code among high-level components seems like an obvious solution, but I’m worried that we’ll get similar problems as the “system-on-a-thread” design for multi-threading.
With such a coarse separation of components, we have to be careful to allocate work among GPUs in a balanced way. If we split work uniformly among GPUs with varying capabilities, then we can easily be bottlenecked by the weakest GPU. Therefore, we might want to again put our tasks in a queue and distribute them among GPUs as they become available. In theory, we can further mitigate this problem with a fork/join approach. For example, if a GPU splits one of its tasks in half, then a more powerful GPU can pick up the second half of the problem while the first half is still being processed by the first GPU. This approach might work best on linked adapters, since they can theoretically share memory more efficiently.
An interesting approach to load-balancing can be found in GPU Pro 7 chapter 5.4: “Semi-static Load Balancing for Low-Latency Ray Tracing on Heterogeneous Multiple GPUs”. It works by roughly splitting the framebuffer among GPUs to ray trace a scene, and alters the distribution of the split dynamically based on results of previous frames.
One complication of distributing tasks among GPUs is that we might want to run a task on the same GPU at each frame, to avoid having to copy the input state of the task to run it on a different GPU. I’m not sure if there’s an obvious solution to this problem, maybe it’s just something to integrate into a heuristic cost model for the scheduler.
A Note On Power
One quite difficult problem with multi-adapter has to do with power. If a GPU is not used for a relatively short period of time, it’ll start clocking itself down to save power. In other words, if you have a GPU that runs a task each frame then waits for another GPU to finish, it’s possible for that first GPU to start shutting itself down. This becomes a problem on the next frame, since the GPU will have to spin up once again, which takes a non-trivial amount of time. As a final result, the code ends up running slower on multi-adapter than it does in single-adapter, despite even the most obvious opportunities for parallelism.
One might suggest to force the GPU to keep running at full power to solve this problem. It’s not so obvious, since drawing power from idle cores takes away power from the cores that need it. This is especially an issue on integrated GPUs, since the GPU would steal juice from the CPU, despite the CPU probably needing that power to run non-GPU code during the rest of the frame. Of course, power-hungry applications are also generally not welcome on battery-operated devices like laptops or phones.
Does this problem have a solution? Hard to say! As a guideline, it might be important to use GPUs only if you plan to utilize them well, and be careful about CPU-GPU tradeoffs on integrated GPUs. We might need help from hardware and OS people to figure this out properly.
NUMA-aware Task Scheduling
An important challenge of multi-adapter code is that memory allocations have an affinity to a given processor, which means that the cost of memory access increases dramatically when the memory does not belong to the processor accessing it. This scenario is known as “Non-uniform memory access”, aka. “NUMA”. It’s a common problem in heterogeneous and distributed systems, and is also a well-known problem in server computers that have more CPU cores than a single motherboard socket can support, which result in multi-socket CPU configurations where each socket/CPU has a set of RAM chips closer to it than others.
There exist some strategies to deal with scheduling tasks in a NUMA-aware manner. I’ll list some from the literature.
Deferred allocation is a way to guarantee that output memory is local to the NUMA node. It simply consists of allocating the output memory only at the time of the task being scheduled, which allows the processor that was scheduled to perform the allocation right-then-and-there in its local memory, thus guaranteeing locality.
Work-pushing is a method to select a worker to which a task should be sent. In other words, it’s the opposite of work-stealing. The target worker is picked based on a choice of heuristic. For example, the heuristic might try to push tasks to the node that owns the task’s inputs, or it might try to push work to the node that own’s the task’s outputs, or the heuristic might combine ownership of inputs and outputs in its decision.
Work-stealing can also be tweaked for NUMA purposes, by tweaking the work-stealing algorithm to first steal work from nearby NUMA nodes first. This might apply itself naturally to the case of sharing work between linked adapters.
Conclusion
Direct3D 12 enables much more fine-grained control over use of multiple GPUs, whether though linked adapters or through heterogeneous hardware components. Enthusiast gamers, professional users, and GPU compute datacenters stand to benefit from good use of this tech, which motivated a search for designs that use multi-adapter effectively. On this front, we discussed Alternate-Frame-Rendering (AFR), and discussed the design of more general task-parallel systems. The design of a task-parallel engine depends a lot on your use case, and there are many unsolved and non-obvious areas of this design space. For now, we can draw inspiration from existing research on NUMA systems and think about how it applies to the design of our GPU programs.