Simple Live C++ Reloading in Visual Studio

I often work on very small projects that have a focus on getting results quickly.

For example:

  • Trying to make a proof-of-concept of an idea.
  • Implementing an algorithm from a paper or an article.
  • Doing something “artistic” where rapid iteration is key.

In these kinds of projects, the effort is often focused on a few important functions. For example, the function that renders the scene, or the function that transforms some geometry. My projects are usually focused on real-time rendering, so I render a scene at an interactive frame rate, while applying a special algorithm every frame to transform the geometry or render the image.

It’s important for me to be able to design and debug the key function of the project as productively as possible. I want to try new things quickly, and I want to be able to visualize my results to better understand my code. To do this, I take many approaches:

In this article, I describe another approach to improve iteration speed:

Live C++ reloading.

Live C++ reloading means that you can build and reload C++ code without needing to restart your program. This allows you to quickly see the effects of changes to your code, which greatly helps productivity.

Live reloading C++ code can also be useful if your program takes a long time to initially load its data. For example, if you have to load a large 3D scene every time you restart your program, that will probably hinder your productivity. If you can update C++ code without needing to restart your program, then you don’t have to reload the scene to see your changes. Also, it allows you to keep the 3D camera in one place while you change your code, so you can more easily debug a problem that only happens from a specific viewpoint.

Of course, there are many different ways to reload C++ code. In this article, I describe a simple and stupid way to do it, which is easy to set up in a Visual Studio project.

How To Live-Reload C++

For completeness, let’s start from scratch. If you want to add this to an existing project, you can skip some steps.

If you just want to see the completed project, please visit the GitHub page:

Step 1. Setup a new solution and main project

Create a new solution and project with Visual Studio.

Its project will serve as the “main” project for your application.



Create a main.cpp for this project. We will add the live-reloading code to it later.



Here’s some dummy code for a “real-time” update loop:

#include <cstdio>

#include <Windows.h>

int main()
  while (true)
    printf("Hello, world!\n");

If you run it, this should display “Hello World!” every second in a loop.


Step 2. Setup a project for live-reloaded code.

Right click on your solution and add a new project to it.

This project will be used for the code we want to live-reload.



Configure this project to build as a DLL using its properties.


By the way, use “All Configurations” and “All Platforms” when using the Property Pages. Otherwise your changes might not update all the Debug/Release/x86/x64 builds.


Now set a dependency from the main project to the DLL project, to make sure that building the main project also builds the DLL project.



Now add a header to the DLL project that defines the interface to your live-reloaded code.



Here’s some placeholder code for live_reloaded_code.h that you can extend:

#pragma once

#define LIVE_RELOADED_CODE_API __declspec(dllexport)
#define LIVE_RELOADED_CODE_API __declspec(dllimport)

extern "C" LIVE_RELOADED_CODE_API void live_reloaded_code();

Now repeat the process to add a cpp source file to the DLL project.



Here’s some placeholder code for live_reloaded_code.cpp that you can extend:

#include "live_reloaded_code.h"

#include <cstdio>

void live_reloaded_code()
  printf("Hello, Live Reloaded World!\n");

Next, add the preprocessor definition for LIVE_RELOADED_CODE_EXPORTS to the DLL project. This makes the DLL project “export” the DLL functions, while the main project will “import” them. Not all strictly important for live-reloading, but it’s a proper setup for a DLL.



Step 3. The live-reloading mechanism

Go back to the original main.cpp. We now add the code that does the live-reloading. It’s around 100 lines of code, so I won’t paste it in this blog post. Instead, please go to the GitHub repository and copy the code from there.

Here is the link, for your convenience:

I tried to write descriptive comments, so hopefully the code is self-explanatory. Here’s a summary of how it works:

The code will poll the timestamp of the DLL file that contains the reloadable functions at every update. When the DLL’s timestamp changes, the DLL will be reloaded and the functions in it will be reloaded too. We make a copy of the DLL before loading it, because if we don’t then the DLL will fail to rebuild, because its file can’t be overwritten while we are using it. Finally, before calling a live-reloaded function, we cast it to a function pointer type that has the correct function signature. In the sample code, C++11’s auto and decltype features are used to do this cast, which avoids redundantly re-defining the function’s type.

Step 4. Using the live-reloading

Now that all the code is ready, we can have fun with live-reloading. To do this, we launch the program “without debugging”, because that allows Visual Studio to build code even if the program is running.

First, make sure the main project is the startup project:


Next, start the project “without debugging“. You can do this by pressing “Ctrl+F5“, or you can go in the following menu:


Now, the program should be running, and displaying the message we wrote earlier:


While this program is running, go to live_reloaded_code.cpp, and modify the message in the printf. After that, save the file, and build the project by pressing “Ctrl+Shift+B“, or by going in the following menu:


After the build, you should see the output of your program change:


That’s it! Have fun!

Known Issues

It might crash rarely depending on how you have it set up. I don’t know why. It has been reliable enough for me.


D3D12 Shader Live-Reloading


I previously wrote about ShaderSet, which was my attempt at making a clean, efficient, and simple shader live-reloading interface for OpenGL 4.

Since ShaderSet was so fun to use, I wanted to have the same thing in my D3D12 coding. As a result, I came up with PipelineSet. This class makes it easy to live-reload shaders, while encapsulating the complexity of compiling pipeline state in a multi-threaded fashion, and also allowing advanced usage to fit your rendering engine’s needs.

Show Me The Code

In summary, the interface looks something like what follows. I tried to show how it fits into the design of a component-based renderer.

// Example component of the renderer
class MyRenderComponent
  ID3D12RootSignature** mppRS;
  ID3D12PipelineState** mppPSO;

  void Init(IPipelineState* pPipeSet)
    // set up your PSO desc
    D3D12_GRAPHICS_PIPELINE_STATE_DESC desc = { ... };

    // associate the compiled shader file names to shader stages
    GraphicsPipelineFiles files;
    // note: scene.vs.cso also contains root signature
    files.RSFile = L"scene.vs.cso";
    files.VSFile = L"scene.vs.cso";
    files.PSFile = L"";

    std::tie(mppRS, mppPSO) = pPipeSet->AddPipeline(desc, files);

  void WriteCmds(ID3D12GraphicsCommandList* pCmdList)
    if (!*mppRS || !*mppPSO)
      // not compiled yet, or failed to compile

    // TODO: Set root parameters and etc

std::shared_ptr<IPipelineState> pPipeSet;

void RendererInit()
  pPipeSet = IPipelineSet::Create(pDevice, kMaximumFrameLatency);

  // let each component add its pipelines
  foreach (component in renderer)

  // Kick-off building the pipelines.
  // Can no longer add pipelines after this point.
  HANDLE hBuild = pPipeSet->BuildAllAsync();

  // wait for pipelines to finish building
  if (WaitForSingleObject(hBuild, INFINITE) != WAIT_OBJECT_0) {
    fprintf(stderr, "BuildAllAsync fatal error\n");

void RendererUpdate()
  // updates pipelines that have reloaded since last update
  // also garbage-collects unused pipelines after kMaximumFrameLatency updates

  foreach (component in renderer)


The big idea is to add pipeline descs to the PipelineSet, and those descs don’t need to specify bytecode for their shader stages. Instead, the names of the compiled shader objects for each shader stage are passed through the “GraphicsPipelineFiles” or “ComputePipelineFiles” struct.

Each added shader returns a double-pointer to the root signature and pipeline state. This indirection allows the root signature and pipeline state to be reloaded, and also allows code to deal with the PipelineSet in an abstract manner. (It’s “just a double pointer”, not a PipelineSet-specific class.)

From there, BuildAllAsync() will build all the pipelines in the PipelineSet in a multi-threaded fashion, using the Windows Threadpool. When the returned handle is signaled, that means the compilation has finished.

Finally, you must call UpdatePipelines() at each frame. This does two things: First, it’ll update any pipelines and root signature that have been reloaded since the last update. Second, it garbage-collects any root signature and pipelines that are no longer used (ie. because they have been replaced by their new reloaded versions.) This garbage collection is done by deleting the resources only after kMaximumFrameLatency updates have passed. This works because it’s guaranteed that no more frames are in flight on the GPU with this pipeline state, since it exceeds the depth of your CPU to GPU pipeline.

The Workflow

IPipelineSet is designed to work along with Visual Studio’s built-in HLSL compiler. The big idea is to rebuild your shaders from Visual Studio while your program is running. This works quite conveniently, since Visual Studio’s default behavior for .hlsl files is to compile them to .cso (“compiled shader object”) files that can be loaded directly as bytecode by D3D12.

Normally, Visual Studio will force you to stop debugging if you want to rebuild your solution. However, if you “Start Without Debugging” (or hit Ctrl+F5 instead of just F5), then you can still build while your program is running. From there, you can make changes to your HLSL shaders while your program is running, and hit Ctrl+Shift+B to rebuild them live. The IPipelineSet will then detect a change in your cso files, and live-reload any affected root signatures and pipeline state objects.

To maintain bindings between shaders and C++, I used a so-called “preamble file” in ShaderSet. This preamble is not necessary with HLSL, since we can use its native #include functionality. Using this feature, I create a hlsli file (the HLSL equivalent of a C header) for the shaders I use. For example, if I have two shaders “scene.vs.hlsl” and “”, I create a third file “”, which contains two things:

  1. The Root signature, as #define SCENE_RS “RootFlags(0), etc”
  2. The root parameter locations, like #define SCENE_CAMERA_CBV_PARAM 0

I include this rs.hlsli file from my vertex/pixel shaders, then put [RootSignature(SCENE_RS)] before their main. From there, I pick registers for buffers/textures/etc using the conventions specified in the root signature.

I also include this rs.hlsli file from my C++ code, which lets me directly refer to the root parameter slots in my code that sets root signature parameters.

As an example, let’s suppose I want to render a 3D model in a typical 3D scene. The vertex shader transforms each vertex by the MVP matrix, and the pixel shader reads from a texture to color the model. I might have a as follows:


#define SCENE_RS \
"DescriptorTable(SRV(t0), visibility=SHADER_VISIBILITY_PIXEL)," \
"StaticSampler(s0, visibility=SHADER_VISIBILITY_PIXEL)"


#endif // SCENE_RS_HLSLI

This code defines the root signature for use in HLSL. (See: Specifying Root Signatures in HLSL) The defines at the bottom correspond to root parameter slots, and they match the order of root parameters specified in the root signature string.

The vertex shader scene.vs.hlsl would then be something like:

#include ""

cbuffer MVPCBV : register(b0) {
    float4x4 MVP;

struct VS_INPUT {
    float3 Position : POSITION;
    float2 TexCoord : TEXCOORD;

struct VS_OUTPUT {
    float4 Position : SV_Position;
    float2 TexCoord : TEXCOORD;

    VS_OUTPUT output;
    output.Position = mul(float4(input.Position,1.0), MVP);
    output.TexCoord = input.TexCoord;
    return output;

Notice that the register b0 is chosen so it matches what was specified in the root signature in Also notice the [RootSignature(SCENE_RS)] attribute above the main.

From there, the pixel shader might look like this:

#include ""

Texture2D Tex0 : register(t0);
SamplerState Smp0 : register(s0);

struct PS_INPUT {
    float4 Position : SV_Position;
    float2 TexCoord : TEXCOORD;

struct PS_OUTPUT {
    float4 Color : SV_Target;

    PS_OUTPUT output;
    output.Color = Tex0.Sample(Smp0, input.TexCoord);
    return output;

Again notice that the registers for the texture and sampler match those specified in the root signature, and notice the RootSignature attribute above the main.

Finally, I call this shader from my C++ code. I include the header from the source file of the corresponding renderer component, I set the root signature parameters, and make the call. It might be something similar to this:

#include ""

class SceneRenderer
    ID3D12RootSignature** mppRS;
    ID3D12PipelineState** mppPSO;

    void Init(IPipelineSet* pPipeSet)
        D3D12_GRAPHICS_PIPELINE_STATE_DESC desc = { ... };

        GraphicsPipelineFiles files;
        files.RSFile = L"scene.vs.cso";
        files.VSFile = L"scene.vs.cso";
        files.PSFile = L"";

        std::tie(mppRS, mppPSO) = pPipeSet->AddPipeline(desc, files);

    void WriteCmds(
        BufferAllocator* pPerFrameAlloc,
        ID3D12GraphicsCommandList* pCmdList)
        if (!*mppRS || !*mppPSO)
            // not compiled yet, or failed to compile

        float4x4* pCPUMVP;
        std::tie(pCPUMVP, pGPUMVP) = pPerFrameAlloc->allocate(
            sizeof(float4x4), D3D12_CONSTANT_BUFFER_DATA_PLACEMENT_ALIGNMENT);

        *pCPUMVP = MVP; 




        /* TODO: Set other GPU state */

There’s things going on here that aren’t strictly the topic of this article, but I’ll explain them anyways because I think it’s very useful for writing D3D12 code.

I use a big upload buffer each frame to write all my CBV allocations to, that’s the purpose of pPerFrameAlloc. Its allocate() function returns both a CPU (mapped) pointer and the corresponding GPU virtual address for the allocation, which allows me to write to the allocation from CPU, then pass the GPU VA while writing commands.

In this case, the per-frame allocation is an upload buffer, so I don’t need to explicitly copy from CPU to GPU (the shader will just read from host memory.) An alternate implementation could use an additional allocator for a default heap, and explicitly make a copy from the upload heap to the default heap.

The per-frame allocator is a simple lock-free linear allocator, so I can use it to make allocations from multiple threads, if I’m recording commands from multiple threads.

I could do something similar to the per-frame allocator for descriptors for the Tex0SRV_GPU, or I could create the descriptor once up-front in the Init(). It’s up to your choice, really.

When the time comes to finally specify the root parameters, I do it using the defines from the included, such as SCENE_RS_MVP_CBV_PARAM. This makes sure my C++ code stays synchronized to the HLSL code.

In Summary

IPipelineSet implements D3D12 shader live-reloading. It encapsulates the concurrent code used to reload shaders, and encapsulates the parallel code that accelerates PSO compilation through multi-threading. It integrates with code without that code needing to be aware of PipelineSet (it’s “just a double-pointer”), and garbage collection is handled efficiently and automatically. Finally, PipelineSet is designed for a workflow using Visual Studio that makes it easy to rebuild shaders while your program is running, and allows you to easily share resource bindings between HLSL and C++.

There are a bunch more advanced features. For example, it’s possible to supply an externally created root signature or shader bytecode, and it’s possible to “steal” signature/pipeline objects from the live-reloader by manipulating the reference count. See the comments in pipelineset.h for details.

You can download PipelineSet from GitHub:

You can integrate it into your codebase by just adding pipelineset.h and pipelineset.cpp into your project. Should “just work”, assuming you have D3D12 and DXGI linked up already.

Comments, critique, pull requests, all welcome.

Intel GPU Assembly with PIX Beta

This is a short tutorial on how you can disassemble your HLSL shaders into Intel GPU (aka Gen) assembly using the newly released PIX tool.

I suspect many of these steps will be simpler in the future. If you’re reading this guide long after its publishing date, you can probably ignore most of the steps.

Step 1: Installing PIX

Download the PIX Beta:

Should be nothing surprising here.

Step 2: Installing Beta GPU Drivers

Install Beta Intel GPU drivers.

First, you must disable automatic driver updates, otherwise Windows will automatically uninstall your Beta drivers. There are instructions how to do this here:

  • Side note: Disabling driver updates seems to be a really convoluted process. Windows is relentless in trying to stop me from using Beta drivers. It’s annoying, but hopefully this won’t be a problem in the future when mainstream drivers have the required features for PIX.

Next, uninstall your current graphics drivers. Open up the Device Manager, go under “Display adapters”, right click your drivers and choose “Uninstall”.

Next, install the Beta graphics drivers.

  1. Next, download the Beta Intel GPU drivers from here:
  2. Download the zip version of the drivers (not the exe), and unzip them.
  3. Back in the Device Manager, click “Action>Add legacy hardware”.
  4. Choose “Install the hardware that I manually select from a list (Advanced)”.
  5. Choose “Display adapters”.
  6. Click “Have Disk”.
  7. In the “Install From Disk” window, click “Browse”.
  8. Pick the inf file (eg: “igdlh64.inf”) from the “Graphics” folder of the drivers.
  9. Click “OK”, then pick the GPU model that corresponds to your computer.
  10. Keep clicking Next until it’s done installing.

Step 3: Disassembly in PIX

Run your program with PIX by setting the executable path and working directory. Unless you have a UWP app, you probably want to “Launch Win32”.


From there, click “Launch”.

Next, press print screen, or click the camera button in PIX, to capture a frame of rendering. Double-click on the small picture of your capture that appears in PIX.

If you see an error popup here, it seems probably because your drivers are not updated enough (or more likely, that Windows automatically reverted your Beta update behind your back.)

Once your capture is open, click on the “Pipeline” tab. Then, click the “Click here to start analysis” text that appears in the window in the bottom half of PIX


Next, click on the “Dispatch” or “Draw” event in the Events window at top left (seen in previous screenshot) for which you are interested in seeing the disassembled shaders. Click on the shader stage you want in the bottom left of the window, then click on “Disassembly” (as below). And voila! Gen assembly for your shader!


D3D12 Multi-Adapter Survey & Thoughts


Direct3D 12 opens up a lot of potential by making it possible to write GPU programs that make use of multiple GPUs. For example, it’s possible to write programs that distribute work among multiple GPUs from linked GPUs (eg: NVIDIA SLI or AMD Crossfire), or even between GPUs from different hardware vendors.

There are many ways to make use of these multi-adapter features, but it’s not obvious yet (at least to me) how to best make use of it. In theory, we should try to make full use of all available hardware on a given computer, but there are difficult problems to solve along the way. For example:

  • How can we schedule GPU tasks to minimize communication overhead between different GPUs?
  • How can we distribute tasks among hardware that vary in performance?
  • How can we use special hardware features? eg: “free” CPU-GPU memory sharing on integrated GPUs.

D3D12 Multi-Adapter Features Overview

To better support multiple GPUs, Direct3D 12 brings two main features:

  1. Cross-adapter memory, which allows one GPU to access memory of other another GPU.
  2. Cross-adapter fences, which allows one GPU to synchronize its execution with another GPU.

Working with multiple GPUs in D3D12 is done explicitly, meaning that sharing memory and synchronizing GPUs must be taken into consideration by the rendering engine, as opposed to being “automagically” done inside GPU drivers. This should lead to more efficient use of multiple GPUs. Furthermore, integrating shared memory and fences into the API allows you to avoid making round-trips to the CPU to interface between GPUs.

For a nice quick illustrated guide to the features described above, I recommend the following article by Nicolas Langley: Multi-Adapter Support in DirectX 12.

D3D12 supports two classes of multi-adapter setups:

  1. Linked Display Adapters (LDA) refers to linked GPUs (eg: NVIDIA SLI/AMD Crossfire). They are exposed as a single ID3D12Device with multiple “nodes”. D3D12 APIs allow you to specify a bitset of nodes when the time comes to specify which node to use, or which nodes should share a resource.
  2. Multiple Display Adapters (MDA) refers to multiple different GPUs installed on the same system. For example, you might have both an integrated GPU and a discrete GPU in the same computer, or you might have two discrete GPUs from different vendors. In this scenario, you have a different ID3D12Device for each adapter.

Another neat detail of D3D12’s multi-adapter features is Standard Swizzle, which allows GPU and CPU to share swizzled textures using a convention on the swizzled format.

Central to multi-adapter code is the fact that each GPU node has its own set of command queues. From the perspective of D3D12, each GPU has a rendering engine, a compute engine, and a copy engine, and these engines are fed through command queues. Using multiple command queues can help the GPU schedule independent work, especially in the case of copy or compute queues. It’s also possible to tweak the priority of each command queue, which makes it possible to implement background tasks.

Use-Cases for Multi-Adapter

One has to wonder who can afford the luxury of owning multiple GPUs in one computer. Considering that multi-adapter wasn’t properly supported before D3D12, it was probably barely worth thinking about, other than scenarios explicitly supported by SLI/Crossfire. In this section, I’ll try to enumerate some scenarios where the user might have multiple GPUs.

“Enthusiast” users with multiple GPUs:

  • Linked SLI/Crossfire adapters.
  • Heterogeneous discrete GPUs.
  • Integrated + discrete GPU.

“Professional” users:

  • Tools for 3D artists with fancy computers.
  • High-powered real-time computer vision equipment.

“Datacenter” users:

  • GPU-accelerated machine-learning.
  • Engineering/physics simulations (fluids, particles, erosion…)

Another potentially interesting idea is to integrate CPU compute work in DirectX by using the WARP (software renderer) adapter. It seems a bit unfortunate to tie everyday CPU work into a graphics API. I guess it might lead to better CPU-GPU interop, or it might open opportunities to experiment with moving work between CPU and GPU and see performance differences. This is similar to using OpenCL to implement compute languages on CPU.

Multi-adapter Designs

There are different ways to integrate multi-adapter into a DirectX program. Let’s consider some options.

Multi-GPU Pipelining

Pipelining with multiple GPUs comes in different flavors. For example, Alternate Frame Rendering (AFR) consists of alternating between GPUs with each frame of rendering, which allows multiple frames to be processed on-the-fly simultaneously. This kind of approach generally requires the scene you’re rendering to be duplicated on all GPUs, and requires outputs of one frame’s GPU to be copied to the inputs to the next frame’s GPU.

AFR can unfortunately limit your design. For example, dependencies between frames can be difficult to implement efficiently. To solve this problem, instead of pipelining at the granularity of frames with AFR, one might pipeline within a frame. For example, half of the frame can be processed on one GPU, then finished on another GPU. In theory, these pipelining approaches should increase throughput, while possibly increasing latency due to the extra overhead of copying data between GPUs (between stages of the pipeline.) For this reason, we have to be careful about the overhead of copies

A great overview of multi-adapter, AFR, and frame pipelining was given in Juha Sjöholm’s GDC 2016 talk: Explicit Multi GPU Programming with DirectX 12


With a good data-parallel division of our work, we can theoretically easily split our work into tasks, then distribute them among GPUs. However, there’s fundamentally a big difference in the ideal level of granularity of parallelism between low-latency (real-time) users and high-throughput (offline) users. For example, work that can be done in parallel within one frame is not always worth running on multiple GPUs, since the overhead of communication might nullify the gains. In general:

  • Real-time programs don’t have much choice outside of parallelism within one frame (or a few frames), since they want to minimize latency, and they can’t predict future user controller inputs anyways.
  • Offline programs might know the entire domain of inputs ahead of time, so they can arbitrarily parallelize without needing to use parallelism within one frame.

If our goal is to render 100 frames of video for a 3D movie, we could split those 100 frames among the available GPUs and process them in parallel. Similarly, if we want to run a machine learning classification algorithm on 1000 images, we can also probably split that arbitrarily between GPUs. We can even deal with varying performance of available GPUs relatively easily: Put the 1000 tasks in a queue, and let GPUs pop them and process them as fast as they allow, perhaps using a work-stealing scheduler if you want to get fancy with load-balancing.

In the case of a real-time application, we’re motivated to use parallelism within each frame to bring content to the user’s face as fast as possible. To avoid the overhead of communication, we might be motivated to split work into coarse chunks. Allow me to elaborate.

Coarse Tasks

To minimize the overhead of communication between GPUs, we should try to run large independent portions of the task graph on the same GPU. Parts of the task graph that run serially are an obvious candidate for running on only one GPU, although you may be able to pipeline those parts.

One way to separate an engine into coarse tasks is to split them based on their purpose. For example, you might separate your project into a GUI rendering component, a fluid simulation component, a skinning component, a shadow mapping component, and a scene rendering component. From there, you can roughly allocate each component to a GPU. Splitting code among high-level components seems like an obvious solution, but I’m worried that we’ll get similar problems as the “system-on-a-thread” design for multi-threading.

With such a coarse separation of components, we have to be careful to allocate work among GPUs in a balanced way. If we split work uniformly among GPUs with varying capabilities, then we can easily be bottlenecked by the weakest GPU. Therefore, we might want to again put our tasks in a queue and distribute them among GPUs as they become available. In theory, we can further mitigate this problem with a fork/join approach. For example, if a GPU splits one of its tasks in half, then a more powerful GPU can pick up the second half of the problem while the first half is still being processed by the first GPU. This approach might work best on linked adapters, since they can theoretically share memory more efficiently.

An interesting approach to load-balancing can be found in GPU Pro 7 chapter 5.4: “Semi-static Load Balancing for Low-Latency Ray Tracing on Heterogeneous Multiple GPUs”. It works by roughly splitting the framebuffer among GPUs to ray trace a scene, and alters the distribution of the split dynamically based on results of previous frames.

One complication of distributing tasks among GPUs is that we might want to run a task on the same GPU at each frame, to avoid having to copy the input state of the task to run it on a different GPU. I’m not sure if there’s an obvious solution to this problem, maybe it’s just something to integrate into a heuristic cost model for the scheduler.

A Note On Power

One quite difficult problem with multi-adapter has to do with power. If a GPU is not used for a relatively short period of time, it’ll start clocking itself down to save power. In other words, if you have a GPU that runs a task each frame then waits for another GPU to finish, it’s possible for that first GPU to start shutting itself down. This becomes a problem on the next frame, since the GPU will have to spin up once again, which takes a non-trivial amount of time. As a final result, the code ends up running slower on multi-adapter than it does in single-adapter, despite even the most obvious opportunities for parallelism.

One might suggest to force the GPU to keep running at full power to solve this problem. It’s not so obvious, since drawing power from idle cores takes away power from the cores that need it. This is especially an issue on integrated GPUs, since the GPU would steal juice from the CPU, despite the CPU probably needing that power to run non-GPU code during the rest of the frame. Of course, power-hungry applications are also generally not welcome on battery-operated devices like laptops or phones.

Does this problem have a solution? Hard to say! As a guideline, it might be important to use GPUs only if you plan to utilize them well, and be careful about CPU-GPU tradeoffs on integrated GPUs. We might need help from hardware and OS people to figure this out properly.

NUMA-aware Task Scheduling

An important challenge of multi-adapter code is that memory allocations have an affinity to a given processor, which means that the cost of memory access increases dramatically when the memory does not belong to the processor accessing it. This scenario is known as “Non-uniform memory access”, aka. “NUMA”. It’s a common problem in heterogeneous and distributed systems, and is also a well-known problem in server computers that have more CPU cores than a single motherboard socket can support, which result in multi-socket CPU configurations where each socket/CPU has a set of RAM chips closer to it than others.

There exist some strategies to deal with scheduling tasks in a NUMA-aware manner. I’ll list some from the literature.

Deferred allocation is a way to guarantee that output memory is local to the NUMA node. It simply consists of allocating the output memory only at the time of the task being scheduled, which allows the processor that was scheduled to perform the allocation right-then-and-there in its local memory, thus guaranteeing locality.

Work-pushing is a method to select a worker to which a task should be sent. In other words, it’s the opposite of work-stealing. The target worker is picked based on a choice of heuristic. For example, the heuristic might try to push tasks to the node that owns the task’s inputs, or it might try to push work to the node that own’s the task’s outputs, or the heuristic might combine ownership of inputs and outputs in its decision.

Work-stealing can also be tweaked for NUMA purposes, by tweaking the work-stealing algorithm to first steal work from nearby NUMA nodes first. This might apply itself naturally to the case of sharing work between linked adapters.


Direct3D 12 enables much more fine-grained control over use of multiple GPUs, whether though linked adapters or through heterogeneous hardware components. Enthusiast gamers, professional users, and GPU compute datacenters stand to benefit from good use of this tech, which motivated a search for designs that use multi-adapter effectively. On this front, we discussed Alternate-Frame-Rendering (AFR), and discussed the design of more general task-parallel systems. The design of a task-parallel engine depends a lot on your use case, and there are many unsolved and non-obvious areas of this design space. For now, we can draw inspiration from existing research on NUMA systems and think about how it applies to the design of our GPU programs.

Using cont with tbb::task_group

Note: Previous post on this topic:

In the last post, I showed a proof of concept to implement a “cont” object that allows creating dependencies between TBB tasks in a dynamic and deferred way. What I mean by “dynamic” is that successors can be added at runtime (instead of requiring the task graph to be specified statically). What I mean by “deferred” is that the successor can be added even after the predecessor was created and spawned, in contrast to interfaces where successors need to be created first and hooked into their predecessor secondly.

The Goal

The goal of this post was to create an interface for cont that abstracts TBB details from everyday task code. TBB’s task interface is low level and verbose, so I wanted to have something productive and concise on top of it.

Extending tbb::task_group

tbb::task_group is a pretty easy way to spawn a bunch of tasks and let them run. An example use is as follows:

int Fib(int n) {
    if( n<2 ) {
        return n;
    } else {
        int x, y;
        task_group g;[&]{x=Fib(n-1);}); // spawn a task[&]{y=Fib(n-2);}); // spawn another task
        g.wait();                // wait for both tasks to complete
        return x+y;

I wanted to reuse this interface, but also be able to spawn tasks that depend on conts. To do this, I made a derived class from task_group called cont_task_group. It supports the following additional syntax:

cont<int> c1, c2;
cont_task_group g;[&]{ foo(&c1); };[&]{ bar(&c2); };
g.with(c1, c2).run([&] { baz(*c1, *c2); });

The with(c...).run(f) syntax spawns a task to run the function f only when all conts in c... are set.

A full example is as follows:

void TaskA(cont<int>* c, int x)
    tbb::task_group g;[&] {
        // A Subtask 1
    g.run_and_wait([&] {
        // A Subtask 2

void TaskB(int y)

void TaskC(int z)
    std::stringstream ss;
    ss << "TaskC received " << z << "\n";
    std::cout << ss.rdbuf();

int main()
    cont<int> c;
    cont_task_group g;[&] { TaskA(&c, 3); });[&] { TaskB(2); });
    g.with(c).run([&] { TaskC(*c); });

This builds the following task dependency graph:

task graph

Sample implementation here: GitHub

TBB Task DAG with Deferred Successors

Note: Previous post on this topic:

I’m thinking about how to implement DAGs of tasks that can be used in a natural way. The problem is that a DAG is a more general construct than what task systems usually allow, due to valid performance concerns. Therefore, if I want to implement a DAG more generally, I need to come up with custom hacks.

Trees vs DAGs

The major difference between a tree and a DAG of tasks is that a DAG allows one task to have an arbitrary number of successor tasks. Also, with a DAG, a task from one subtree of tasks can send data to a task from a different subtree of tasks, which allows you to start tasks when all their inputs are ready rather than when all the previous tasks have run and finished. (Hmm… That makes me think of out-of-order versus in-order processors.) This expands on the functionality of a tree of tasks, since trees only allow outputs to be passed to their immediate parent, whereas DAGs can pass data to grandparents or great-grandparents, or to tasks in subsequent trees.

By default, tasks in task systems like TBB and Cilk are designed to have only one successor: a parent task, or a continuation task. Having a parent task makes it possible to have nested tasks, which is useful for naturally spawning tasks from tasks, similarly to how functions can call other functions. Continuation tasks make it potentially more efficient to spawn a follow-up task to handle the results of a task, and they can do so without affecting the reference count of the parent task.

DAG Implementation

To implement a general DAG, you need to (in one way or other) keep track of the connections between the outputs of some tasks and the inputs of other tasks in a more general way. There are two aspects to this connection:

  1. How is memory allocated for the data passed from predecessor task to successor task?
  2. How is the successor task spawned when its inputs are all satisfied?

According to Intel’s documentation (See: General Acyclic Graphs of Tasks), it’s suggested that the memory for the passed data is stored within the successor task itself. Each task object contains the memory for its inputs, as well as a counter that keeps track of how many inputs need to be received before the task can be spawned. In the TBB example, each task also keeps a list of successors, which allows predecessors to write their outputs to their successor’s inputs, and allows the predecessor to decrement their successor’s count of missing arguments (and can finally spawn the successor task if the predecessor finds that it just gave the successor its final missing input.)

The Problem with DAGs

In the TBB DAG example, all tasks are spawned up-front, which is easy to do in their example since the task graph is structured in a formal way. In my case, I want to use a DAG to implement something that looks like a sequence of function calls, except using tasks instead of functions, to allow different parts of the code to be executed in parallel. I want to use a DAG to make it possible to establish dependencies between these sequential tasks, to allow the programmer to create tasks that start when their inputs are available. In short, instead of creating the tasks up front, I want to create the tasks in roughly sequential order, for the purpose of readability.

The problem with what I want to do is that I can’t directly store the outputs of predecessors inside the successor. Since the predecessors need to be passed a pointer to where their outputs should be stored, the successor (which stores the inputs) needs to be allocated before the predecessors. This means you roughly need to allocate your tasks backwards (successor before predecessor), but spawn the tasks forward (predecessors before successors). I don’t like this pattern, since I’d rather have everything in order (from predecessor to successor). It might not be absolutely as efficient, but I’m hoping that the productivity and readability improvement is worth it.

The Interface

Instead of spawning successor tasks up front, I’d like to allocate the storage for the data up front, similarly to Cilk. Cilk has a “cont” qualifier for variables, which can be used to communicate data from one task to another. For example, the fibonacci example from the Cilk paper contains the following code:

cont int x, y;
spawn_next sum(k, ?x, ?y);
spawn fib (x, n-1);
spawn fib (y, n-2);

This code computes fib(n-1) and fib(n-2), then passes the results of those two computations to a sum continuation task, which implements the addition within fib(n) = fib(n-1) + fib(n-2). The data is passed through the x and y variables, which are marked cont. I don’t know why the Cilk authors put the call to spawn sum before the calls to fib in this example, but perhaps this alternate arrangement of the code is possible:

cont int x, y;
spawn fib (x, n-1);
spawn fib (y, n-2);
spawn_next sum(k, ?x, ?y);

With this alternative arrangement of the code, the order of the tasks being spawned mirrors the equivalent code in plain sequential C:

int x, y;
fib(&x, n-1);
fib(&y, n-2);
sum(&k, x, y);

Implementing cont

If the successor task is allocated before the predecessors, the predecessor that supplies that final missing input to the successor can also spawn the successor. However, if the successor is allocated after the predecessors, it’s possible that all predecessors finish their work before the successor is allocated. If we allocate the successor and find that all its inputs are already available, we can simply immediately spawn it. However, if some inputs of the successor are still not available, then we need to find a way to spawn the successor when those inputs become available.

To spawn successors when a cont becomes available, each cont can keep a list of successors that are waiting for it. When the cont is finally set, it can pass that input to its successors, and potentially also spawn successors if this cont was their final missing input. The difficulty with implementing this system lies in resolving the race condition between the predecessor, the cont, and successors.

Here’s an example of a race condition between predecessor/cont/successor. Suppose a successor task is spawned with some cont inputs. The successor might see that a cont input has not yet been set, so the successor adds itself to the list of successors in the cont. However, it might be possible that the cont suddenly becomes set in a different thread while the successor is adding itself to the cont’s successor list. The thread that is setting the cont might run before it sees the new successor added, so it might not notify the successor that the input is now complete (by decrementing its counter and possibly spawning it.)

cont successor linked list

It might be possible to solve the problem described above if it’s possible for the successor to atomically check that the cont is not satisfied and if so add itself to the cont’s list of successors. This might be possible if the cont uses a linked list of successors, since the successor could do a compare-and-swap that sets a new head for the list only if the input is unsatisfied.

If that compare-and-swap fails because the cont became set just before the CAS, the successor can just treat the input as satisfied and move on to checking the next input. If the CAS fails because another successor registered themselves concurrently, then the registration needs to be tried again in a loop. On the other side of that race condition, if the cont’s compare-and-swap to set its completion fails because a successor added themselves to the list concurrently, then the cont just tries again in a loop, which allows the cont to notify the successor that interrupted it of completion.

If the cont becomes set while the successor is hooking up other inputs, the cont will pass the input and decrement the counter, which itself is not a problem. However, the last cont might finish and spawn the successor before the function hooking the successor finishes, which shouldn’t be a problem as long as nothing else needs to happen after the successor finishes hooking itself up to its inputs. If something does need to happen immediately after the hookups, the successor can initialize its counter with an extra 1, which it only decrements at the end of the hookup (and potentially then spawns the task.)

A detail of this implementation is the allocation of the nodes of the linked list. The successor task needs to have allocated with it at least one linked list node per cont. Since this is a fixed number, the allocation can be done more efficiently.

There’s a difficulty with the compare-and-swap, which is that two different compare-and-swaps need to be done. First, the successor should only be added to the list if the cont is not yet set. Second, appending to a linked list in a CAS can fail if two successors try to add themselves to the list simultaneously. To solve this problem, I propose that the cont’s “is set” boolean is stored as the least significant bit of the head of the linked list. This allows operations on the cont to both switch the head of the list and compare and set completeness simultaneously. If pointers are aligned then the least significant bit is always 0, so no information is lost by reusing that bit. We just need to make sure to mask out that bit before dereferencing any pointers in the linked list.


I tried scratching up an implementation here: (see tbbtest/main.cpp)

The main additions are as follows:

  • cont_base: Base class for “cont” objects, manages an atomic linked list of successor tasks.
  • cont: Derived from cont_base, stores data in “std::optional-style” to pass data from predecessor to successors.
  • spawn_when_ready: Spawns a task when a list of conts are all set. Interface still a bit low-level.

I’ve only done a few tests, so I don’t know if it’s 100% correct. I only really have one test case, which I’ve debugged in a basic way by adding random calls to sleep() to test different orders of execution. I wouldn’t consider using this in production without a thorough analysis. It’s honestly not that much code, but I’m not an expert on the intricacies of TBB.

Also, I was lazy and used sequential consistency for my atomic operations, which is probably overkill. Any lock-free experts in the house? 🙂 (Update: I’ve replaced the seq_cst with acquires and releases. Fingers crossed.)

I’ve also not sugar-coated the syntax much, so there’s still lots of low-level task management. I’d like to come up with syntax to spawn tasks in a much simpler way than manually setting up tbb::task derivatives, setting reference counts, allocating children, etc. This is a topic for another post.

With this implementation, I was able to build the following task dependency graph, with each edge annotated with its type of of dependency.


Cilk Syntax Study

Note: Previous post on this topic:

I’m thinking more about how one can use TBB to write task code that looks similar to existing C code. Of course, people have tried to do this before, and made languages that integrate task parallelism naturally. This article takes a look at these existing solutions, looking for inspiration.


Probably the most well-known task-based programming language is Cilk

Here’s an example Cilk procedure (from the paper above):

thread fib (cont int k, int n)
  if (n < 2)
    cont int x, y;
    spawn_next sum(k, ?x, ?y);
    spawn fib (x, n-1);
    spawn fib (y, n-2);

thread sum (cont int k, int x, int y)
  send_argument (k, x+y);

There’s a certain amount of syntactical sugar here:

  • functions that act as tasks have a “thread” qualifier
  • a “spawn” keyword differentiates spawning child tasks from calling functions
  • a “spawn_next” keyword spawns a continuation task (to spawn more tasks until results arrive)
  • “cont”-qualified variables allow passing data from predecessor to successor task.
  • a built-in “send_argument” sets “cont” variables, and spawns tasks with fully satisfied arguments.
  • a built-in “?” operator allows declaring the dependency of a successor on a predecessor.

This is some pretty cute syntax. My main worry is that there might be overhead in the automation of passing around continuation variables. In contrast, TBB also allows creating continuation tasks, but it requires you to pass continuation arguments by reference manually. For example, TBB users can create a continuation task with inputs as member variables, and the address of these member variables are used as destination addresses for the childrens’ computation. See Continuation Passing. Still, the TBB continuation syntax is pretty tedious to type (and probably error-prone), and I wonder if we can do some C++ magic to simplify it.

The “spawn” and “spawn_next” syntax makes spawning tasks look a lot like calling functions, which is consistent with the goals I described in the previous post. The “cont” variables might be possible to implement by wrapping them in a C++ type, which could implement operator= (or a similar function) for the purpose of implementing an equivalent to “send_argument”. Cilk allows cont variables to be passed as arguments that are declared non-cont (such as sum’s x/y above), and automatically unwraps them when the time comes to actually call the function. In a C++ implementation, this automatic unwrapping might be possible to implement with a variadic function template that unwraps its arguments before passing them to the desired function. If that’s too difficult, we can fall back to defining the continuation function with explicit “std::future”-like arguments, requiring using a special function to unwrap them at the usage site.

I think one of the best things about Cilk is implicitly generating dependencies between tasks by passing arguments. This is much less work and is more maintainable than explicitly declaring dependencies. It does not deal with running two tasks in an order based on side-effects, like if you want printf() calls in two tasks to always happen in the same order. This might be possible to mitigate your in design by factoring out side-effects. Alternatively, we could create a useless empty struct and use that to indicate dependencies while reusing the syntax used to pass meaningful data. This is very similar to the tbb::flow::continue_msg object used in TBB flow graph.

By the way, Cilk’s dependencies are implemented by keeping a counter of missing arguments for tasks. When the counter reaches 0, the task can be executed. This is very similar to how TBB tasks implement child-parent dependencies. The awkwardness is that TBB normally only supports a single parent task, so a task with multiple parents need to be handled specially. See General Acyclic Graphs of Tasks.

Cilk Plus

Cilk Plus is an extension of C/C++ available in some compilers. It enables features similar to Cilk in a way that interops with C/C++. However, instead of any continuation passing, it defines a keyword “cilk_sync”, which waits for all child tasks to finish executing before proceeding. This is probably perfect for fork-join parallelism (a tree of tasks), but I’m not sure if it’s possible to implement a general directed acyclic graph with these features.


The ISPC language is mainly useful for writing high-performance SIMD code, but it also defines some built-in syntax for task parallelism. Namely, it supports built-in “task”, “launch”, and “sync” keywords. Again, this seems limited only to fork-join parallelism.


I’ve seen a few other languages with task-parallelism, but they usually seem to stop at fork/join parallelism, without talking about how continuations or DAGs might be implemented. If you know about a programming language interface that improves on the syntax of Cilk for creating continuations, please tell me about it.


I like Cilk’s ideas for passing data from child task to parent task. Implementing an interface similar to it in C++ using TBB might allow a pretty natural way of implementing task parallelism both for fork/join task trees or more general task DAGs. My main concern is making an interface that makes it easy to do common tasks.

I think that continuation passing might be an elegant way to implement sequential-looking code that actually executes in a DAG-like fashion, which would make it easy to reuse the average programmer’s intuition of single-threaded programming. I want the DAG dependencies to be built naturally and implicitly, similar to how Cilk implements “cont” variables. I want to make it easy to create placeholder “cont” variables that are used only to build dependencies between tasks with side-effects that need to operate in a specific order, similarly to tbb::flow::continue_msg. I also want a way to have a node with multiple parents (to implement general DAGs), and I’d like to minimize the overhead of doing that.

One of my main concerns is how to encapsulate the reference count of tasks. TBB sample programs (in the TBB documentation) all work with reference counts in a pretty low-level way, which may be suitable for when you want to carefully accelerate a specific algorithm, but seems error-prone for code that evolves over time. I hope that this logic can be encapsulated in objects similar to Cilk’s “closure” objects. I think these closure objects could be implemented by creating a derived class from tbb::task, and some C++ syntactical sugar (and maybe macros) could be used to simplify common operations of the task system. From there, I’m worried about the potential overhead of these closures. How can they be allocated efficiently? Will they have a lot of atomic counter overhead? Will their syntax be weird? I’ll have to do some experimentation.