Category: Uncategorized

How To Find Answers About Japanese

Teach a man to fish…

I’ve been studying Japanese as a hobby for a bit more than a year now, and over time I’ve become better and better at finding answers to my own questions about the language. For this blog post, I made a list of some of the resources and methods you can use to find answers to your own questions too.

First of all

Have you done a basic Google search? That is often enough.

Kanji

To find general information about a kanji, search for it on jisho.org.

To find words that use a kanji, search for it on jisho.org using wildcards.

To find basic etymology of kanji, search on wikitionary.org.

Words and Sentences

For English-Japanese or Japanese-English word lookup, search on jisho.org.

You can also find idioms and proverbs on jisho.org.

  • eg: Search for 初心忘るべからず to find its meaning.

To search Japanese dictionaries, I suggest kotobank or weblio.

  • Japanese dictionaries have way more detail. Use them!!!!
  • Seriously!!

Note that many words have self-explanatory definitions from their kanji.

  • eg: 人工 = “human-craft” = artificial
  • You can find the meanings of kanji on jisho.org (as explained earlier.)
  • See Appendix 4 of “The Kodansha Kanji Learner’s Course” for details.

You can search for sentences using jisho.org, ejje.weblio.jp, and Google.

Also, if you Google search for a word/expression, you usually find information about it.

I recommend Yomichan for quick dictionary lookups. Especially with a J-J dictionary.

Pronunciation

Dogen’s lectures on Japanese phonetics are awesome. You can find them on his Patreon.

  • The first few lessons are free on YouTube.

You can find clips of native speakers saying words using forvo.com.

JapanesePod101 also has a database of native speakers saying words.

  • You can conveniently access this database through Yomichan.

Note that Japanese dictionaries often include pitch accent information.

  • This is normally written as [N], where N is the mora of the pitch drop.
  • If it says [0], then the word has no pitch drop (ie. it has the 平板 pitch pattern).
  • eg: Searching for 食べる on weblio says [2] for the pitch drop on the 2nd mora.

You can also see pitch accent information using “Suzuki-kun: Prosody Tutor”.

  • Note that Suzuki-kun is not always right.

Grammar

For basic grammar, have a decent textbook at hand.

The “A Dictionary of Japanese Grammar” books are a good reference in English.

  • The books mainly cover grammar forms, but the appendices are also interesting.

As a formal reference for grammar, I recommend 庭三郎 (niwasaburoo)’s guide.

  • This website is amazingly detailed considering that it is totally free.
  • If the text looks broken, set your web browser’s text encoding to EUC-JP.
  • The author also made an amazing dictionary of verbs and their usage.

The YouTube channel 日本語の森 (nihongo no mori) has many free grammar lectures.

  • Use YouTube’s search feature if you’re looking for a specific grammar form.

Many grammar forms and “grammar words” are also explained in Japanese dictionaries.

Of course, if you Google search a grammar form, you usually find information about it.

Still can’t find an answer?

If you still can’t find an answer, ask on Japanese Language Stack Exchange.

  • Search the website before, to see if your question has already been answered.
  • Do not post translation/transcription requests without showing prior work.
  • Avoid asking overly opinion-based questions.
  • Read the rest of the rules too 🙂
Advertisements

Data-Driven C++ Code using Lua

Introduction

In this article, I would like to share a C++/Lua programming technique that I enjoyed using in a past project. This technique allows you to define functions in a data-driven way using Lua scripts. These functions are converted into a domain-specific “bytecode” at initialization time, which can then be evaluated in C++ without using a Lua runtime. By evaluating functions as bytecode in C++, we can simplify and optimize Lua use by minimizing garbage collection and inter-language function calls.

This article is split into two main parts. In the first part, I describe the motivating use case that inspired me to use this technique. In the second part, I describe the details of how the technique was implemented for the aforementioned use case.

Several code samples are raised from a project on Github, which can be seen in full here: https://github.com/nlguillemot/robdd

Motivating Use Case

The project where I used this technique was an implementation of the Binary Decision Diagram data structure. The input to the program is a Boolean function, and the output of the program is the number of possible solutions that make the Boolean function true. More specifically, I implemented the Multicore Binary Decision Diagram as described by Yuxiong He.

For this project, I wanted to specify the inputted Boolean functions through scripts, mainly because I thought it would be cute and fun. As a final result, I was able to specify Boolean functions as follows:

title = 'test'
display = true

a = input.a
b = input.b
c = input.c

r1 = a*b + a*c + b*c
r2 = b*c

output.r1 = r1
output.r2 = r2

This script is broken down into four parts: The options, the inputs, the Boolean function itself, and its outputs.

The script begins by setting some options. The “title” is the name of the function, and “display = true” will display the binary function as a dot graph. Following these options, some inputs to the Boolean function are declared. Inputs are declared through the “index” table, and those inputs are assigned to variables with shorter names for convenience. After declaring the inputs, they are used in some Boolean operations. Here, the “+” operator means logical OR, and the “*” operator means logical AND. Finally, the results of these Boolean operations are marked as outputs, by putting them in the “output” table.

The result of running the script is the following diagram, which shows all ways in which the outputs of the Boolean functions can either evaluate to false (0) or true (1). By tracing a path from the top of the diagram to the bottom, while following dotted lines when a variable is false, and following solid lines when a variable is true, you can see the result of the Boolean function for a specific assignment of values to its inputs.

Simple Boolean function

By specifying the Boolean function as a Lua script, I was able to programmatically express various more complicated Boolean functions. For example, an n-bit ripple carry adder and the n-queens puzzle were programmatically specified with generic functions in terms of “n”. I was also able to make a general Lua function to generate Boolean functions for 3-colorings or 4-colorings of graphs, and I was able to conveniently import and reuse this function from other scripts to define Boolean functions for graph-colorings of the Petersen graph, the states of the United States, and the prefectures of Japan. Since I used Lua as a configuration language, I was able to easily create all these complicated Boolean functions, which wouldn’t be so conveniently possible if I was using JSON or XML as the input of my program.

If you’re curious about more details of my Binary Decision Diagram project, please consult this report pdf.

Implementation

At a high-level, the implementation is split into two parts:

  • Running the Lua script to generate bytecode.
  • Executing the bytecode.

These two steps are described separately in this section.

Bytecode Generation

The generated bytecode has two main purposes: declaring new variables, and describing Boolean operations with them. Each instruction of this bytecode has an “opcode” that defines what type of instruction it is, and stores references to its inputs and outputs. The bytecode is built by appending an “instruction” to a std::vector every time an operation happens in the Lua script.

The following code sample shows the structure of the instructions used in the bytecode, and the std::vector that stores the instruction bytecode:

struct bdd_instr
{
    enum {
        opcode_newinput,
        opcode_and,
        opcode_or,
        opcode_xor,
        opcode_not
    };

    int opcode;

    union
    {
        struct {
            int operand_newinput_ast_id;
            int operand_newinput_var_id;
            const std::string* operand_newinput_name;
        };

        struct {
            int operand_and_dst_id;
            int operand_and_src1_id;
            int operand_and_src2_id;
        };

        struct {
            int operand_or_dst_id;
            int operand_or_src1_id;
            int operand_or_src2_id;
        };

        struct {
            int operand_xor_dst_id;
            int operand_xor_src1_id;
            int operand_xor_src2_id;
        };

        struct {
            int operand_not_dst_id;
            int operand_not_src_id;
        };
    };
};

std::vector<bdd_instr> g_bdd_instructions;

The inputs and outputs to each instruction are specified by Abstract Syntax Tree IDs (“AST IDs”). Initially, the IDs “0” and “1” are specially assigned to “false” and “true”, respectively. From there, monotonically increasing AST IDs are assigned to newly created variables, and to results of Boolean operations. Since a new AST ID is created for every intermediate result, the bytecode effectively represents a program where the result of every operation is immutable, even if the Lua code that created it might have syntactically reused a variable name.

The monotonically increasing AST ID numbers (and the specially reserved values for “false” and “true”) are represented by the following code:

enum {
    ast_id_false,
    ast_id_true,
    ast_id_user
};

int g_next_ast_id = ast_id_user;

Beside the AST IDs, a different type of monotonically increasing ID is used to uniquely identify input variables to the Boolean function specified by the script. These variable ID numbers are used to keep track of the inputs of the function (such as “input.a” and “input.b”), and they are used to record associations between AST IDs and input variables.

The monotonically increasing variable ID counter (starting at 0) is simply declared as follows:

int g_num_variables = 0;

The code that connects the Lua script to these C++ data structures is implemented mostly through Lua metatables. In particular, the metatable for the “input” table, and the metatable for each “ast” node of the computation. The implementation of these metatables will now be described.

The Input Table

The “input” table has two defined metamethods. The first one is “__newindex”, which is simply defined to forbid assignment to inputs. Obviously, inputs should only be read, not written. It is defined as follows:

int l_input_newindex(lua_State* L)
{
    luaL_error(L, "Cannot write to inputs table");
    return 0;
}

The second metamethod, “__index”, is much more interesting:

int l_input_index(lua_State* L)
{
    int* ast_id = (int*)lua_newuserdata(L, sizeof(int));
    *ast_id = g_next_ast_id;
    g_next_ast_id += 1;

    auto varid2name = g_varid2name.emplace(g_num_variables, luaL_checkstring(L, 2)).first;
    
    g_num_variables += 1;

    bdd_instr new_instr;
    new_instr.opcode = bdd_instr::opcode_newinput;
    new_instr.operand_newinput_ast_id = *ast_id;
    new_instr.operand_newinput_var_id = varid2name->first;
    new_instr.operand_newinput_name = &varid2name->second;
    g_bdd_instructions.push_back(new_instr);

    luaL_newmetatable(L, "ast");
    lua_setmetatable(L, -2);

    lua_pushvalue(L, -2);
    lua_pushvalue(L, -2);
    lua_rawset(L, -5);

    return 1;
}

When the “__index” metamethod is called, a new input variable is created. Each variable created with this method is given a monotonically increasing variable ID, and an AST node (with its own AST ID) is created to represent this variable in the computation. From there, a instruction that creates a new input is appended to the bytecode. This “new input” instruction stores the AST ID, the variable ID, and the name of the variable (for display purposes.)

The Lua API calls at the end of the function above do a few different things. First, it associates the “ast” metatable to a Lua object that contains the AST ID. After that, it uses “rawset” to insert the newly created Lua “ast” object to the “input” table. Finally, it returns the newly created “ast” object.

Note that the step with the “rawset” is crucial. The “__index” metamethod is called only if the key used to access the “input” table does not exist in the table itself. Therefore, inserting the object into the “input” table guarantees that each variable will only be created once. In other words, an expression like “input.a + input.a” will not create two duplicate variables or duplicate AST nodes.

AST Nodes

After a Lua object with the “ast” metatable is returned by accessing the “input” table, this object will be used in various operations. These operations generally take “ast” objects as input, and return a new “ast” object as output. For example, a logical OR (indicated by “+”) appends a corresponding OR instruction to the bytecode, and the result of the operation returns a new Lua “ast” object to the script. This newly returned “ast” object stores its own unique AST ID, and it can be used as input to further operations.

Note that these Boolean operations are deferred: Rather than immediately computing the result of these operations, the inputs and outputs of the operations are recorded, which allows the C++ code to replay the sequence of operations later.

As an example of a Boolean operator, consider the following code that implements logical OR:

int l_or(lua_State* L)
{
    int ast1_id = arg_to_ast(L, 1);
    int ast2_id = arg_to_ast(L, 2);

    int* ast_id = (int*)lua_newuserdata(L, sizeof(int));
    *ast_id = g_next_ast_id;
    g_next_ast_id += 1;

    bdd_instr or_instr;
    or_instr.opcode = bdd_instr::opcode_or;
    or_instr.operand_or_dst_id = *ast_id;
    or_instr.operand_or_src1_id = ast1_id;
    or_instr.operand_or_src2_id = ast2_id;
    g_bdd_instructions.push_back(or_instr);

    luaL_newmetatable(L, "ast");
    lua_setmetatable(L, -2);

    return 1;
}

The code above gets the AST IDs of its inputs, and creates a new AST ID for its output. A logical OR instruction is then added to the bytecode. The output’s AST ID is stored in the Lua object with the “ast” metatable, and this newly created “ast” object is returned from the function.

Note that the “arg_to_ast” function exists to allow Lua’s built-in “true” and “false” values to be used in Boolean operations with “ast” objects. It simply returns the specially pre-allocated IDs if the input is a “true” or “false” value, and otherwise just returns the ast node’s ID, as follows:

int arg_to_ast(lua_State* L, int argidx)
{
    if (lua_isboolean(L, argidx))
        return lua_toboolean(L, argidx) ? ast_id_true : ast_id_false;
    else
        return *(int*)luaL_checkudata(L, argidx, "ast");
}

Putting it together

At the program’s initialization time, the metatables for the “ast” nodes and the “input” table are created, as well as the “input” and “output” tables themselves. After everything is set up, the Lua program itself is executed. This is done as follows:

lua_State* L = luaL_newstate();
luaL_openlibs(L);

luaL_newmetatable(L, "ast");
{
    lua_pushcfunction(L, l_and);
    lua_setfield(L, -2, "__mul");
    
    lua_pushcfunction(L, l_or);
    lua_setfield(L, -2, "__add");

    lua_pushcfunction(L, l_xor);
    lua_setfield(L, -2, "__pow");

    lua_pushcfunction(L, l_not);
    lua_setfield(L, -2, "__unm");
}
lua_pop(L, 1);

luaL_newmetatable(L, "input_mt");
{
    lua_pushcfunction(L, l_input_newindex);
    lua_setfield(L, -2, "__newindex");

    lua_pushcfunction(L, l_input_index);
    lua_setfield(L, -2, "__index");
}
lua_pop(L, 1);

lua_newtable(L);
luaL_newmetatable(L, "input_mt");
lua_setmetatable(L, -2);
lua_setglobal(L, "input");

lua_newtable(L);
lua_setglobal(L, "output");

if (luaL_dofile(L, infile))
{
    printf("%s\n", lua_tostring(L, -1));
    return 1;
}

After the program finishes running, the C++ code iterates over the contents of the “outputs” table in order to keep track of the AST IDs that correspond to the output variables. I think this code is not interesting enough to explain in detail, but feel free to look at it yourself.

Bytecode Execution

After the bytecode list has been fully recorded, it is passed as an input to a “decode” function. At its core, this “decode” function is just a simple loop with a “switch” statement inside, which executes each Boolean operation one-by-one based on their opcode ID. Each operation is implemented by simply making the equivalent call using the API of the Binary Decision Diagram data structure. Namely, the “new input” instruction is implemented by calling a “make_node” function, and the other instructions are implemented using an “apply” function. The details of these functions are strictly related to the implementation of the binary decision diagram, which is outside the scope of this article.

As a more visual example of what the bytecode looks like, consider the following debug output produced by the simple Boolean function that was shown near the start of this article. In this debug output, each number represents an AST node ID. Therefore, a statement like “7 = 5 OR 6” means that the AST nodes with IDs 5 and 6 are OR-ed together to produce a new AST node with ID 7. Also, a statement like “2 = new 0 (a)” means that a variable called “a” was created with variable ID 0, and stored in AST node ID 2.

2 = new 0 (a)
3 = new 1 (b)
4 = new 2 (c)
5 = 2 AND 3
6 = 2 AND 4
7 = 5 OR 6
8 = 3 AND 4
9 = 7 OR 8
10 = 3 AND 4

When the bytecode has been fully executed, the “decode” function simply outputs the binary decision diagram nodes that correspond to the outputs of the Boolean function, which are used to display the output of the program.

Conclusions and Possibilities

In this article, I discussed the implementation details of a programming technique that allows functions to be specified in Lua and converted into a bytecode that can be conveniently evaluated from C++. Although this article focused on a particular implementation for the sake of explanation, I hope this is also helpful to solve similar problems.

If this technique is implemented for a different purpose, the details of the code will likely change a lot. For example, the bytecode’s representation, the “ast” nodes, and the bytecode interpretation would likely be different. With such tweaks, it may be possible to adapt this technique to various applications.

Potential applications include: SQL-like queries, particle effect specifications, render pass specifications, AI behaviors, and RPG battle system damage calculations.

As a further step, the “bytecode” could be used as a source of optimizations. For example, dead code elimination could be applied by traversing the list of instructions, or independent data structure lookups could be batched. The kinds of optimizations that can be applied depends highly on the application of the technique.

I hope you found this useful or interesting. Thanks for reading!

Simple Live C++ Reloading in Visual Studio

I often work on very small projects that have a focus on getting results quickly.

For example:

  • Trying to make a proof-of-concept of an idea.
  • Implementing an algorithm from a paper or an article.
  • Doing something “artistic” where rapid iteration is key.

In these kinds of projects, the effort is often focused on a few important functions. For example, the function that renders the scene, or the function that transforms some geometry. My projects are usually focused on real-time rendering, so I render a scene at an interactive frame rate, while applying a special algorithm every frame to transform the geometry or render the image.

It’s important for me to be able to design and debug the key function of the project as productively as possible. I want to try new things quickly, and I want to be able to visualize my results to better understand my code. To do this, I take many approaches:

In this article, I describe another approach to improve iteration speed:

Live C++ reloading.

Live C++ reloading means that you can build and reload C++ code without needing to restart your program. This allows you to quickly see the effects of changes to your code, which greatly helps productivity.

Live reloading C++ code can also be useful if your program takes a long time to initially load its data. For example, if you have to load a large 3D scene every time you restart your program, that will probably hinder your productivity. If you can update C++ code without needing to restart your program, then you don’t have to reload the scene to see your changes. Also, it allows you to keep the 3D camera in one place while you change your code, so you can more easily debug a problem that only happens from a specific viewpoint.

Of course, there are many different ways to reload C++ code. In this article, I describe a simple and stupid way to do it, which is easy to set up in a Visual Studio project.

How To Live-Reload C++

For completeness, let’s start from scratch. If you want to add this to an existing project, you can skip some steps.

If you just want to see the completed project, please visit the GitHub page:

https://github.com/nlguillemot/live_reload_test

Step 1. Setup a new solution and main project

Create a new solution and project with Visual Studio.

Its project will serve as the “main” project for your application.

newprojmenu.png

newproj.png

Create a main.cpp for this project. We will add the live-reloading code to it later.

newitem.png

newcpp.png

Here’s some dummy code for a “real-time” update loop:

#include <cstdio>

#include <Windows.h>

int main()
{
  while (true)
  {
    printf("Hello, world!\n");
    Sleep(1000);
  }
}

If you run it, this should display “Hello World!” every second in a loop.

hello.png

Step 2. Setup a project for live-reloaded code.

Right click on your solution and add a new project to it.

This project will be used for the code we want to live-reload.

addproj.png

anothernew.png

Configure this project to build as a DLL using its properties.

props.png

By the way, use “All Configurations” and “All Platforms” when using the Property Pages. Otherwise your changes might not update all the Debug/Release/x86/x64 builds.

fucksdf.png

Now set a dependency from the main project to the DLL project, to make sure that building the main project also builds the DLL project.

projdep.png

projdep2.png

Now add a header to the DLL project that defines the interface to your live-reloaded code.

asdfasdf.png

asdfasdf32.png

Here’s some placeholder code for live_reloaded_code.h that you can extend:

#pragma once

#ifdef LIVE_RELOADED_CODE_EXPORTS
#define LIVE_RELOADED_CODE_API __declspec(dllexport)
#else
#define LIVE_RELOADED_CODE_API __declspec(dllimport)
#endif

extern "C" LIVE_RELOADED_CODE_API void live_reloaded_code();

Now repeat the process to add a cpp source file to the DLL project.

4h3e.png

asdfasdf31.png

Here’s some placeholder code for live_reloaded_code.cpp that you can extend:

#include "live_reloaded_code.h"

#include <cstdio>

void live_reloaded_code()
{
  printf("Hello, Live Reloaded World!\n");
}

Next, add the preprocessor definition for LIVE_RELOADED_CODE_EXPORTS to the DLL project. This makes the DLL project “export” the DLL functions, while the main project will “import” them. Not all strictly important for live-reloading, but it’s a proper setup for a DLL.

cxfjvhsdf.png

asdfjkalsdf.png

Step 3. The live-reloading mechanism

Go back to the original main.cpp. We now add the code that does the live-reloading. It’s around 100 lines of code, so I won’t paste it in this blog post. Instead, please go to the GitHub repository and copy the code from there.

Here is the link, for your convenience:

https://github.com/nlguillemot/live_reload_test/blob/master/live_reload_test/main.cpp

I tried to write descriptive comments, so hopefully the code is self-explanatory. Here’s a summary of how it works:

The code will poll the timestamp of the DLL file that contains the reloadable functions at every update. When the DLL’s timestamp changes, the DLL will be reloaded and the functions in it will be reloaded too. We make a copy of the DLL before loading it, because if we don’t then the DLL will fail to rebuild, because its file can’t be overwritten while we are using it. Finally, before calling a live-reloaded function, we cast it to a function pointer type that has the correct function signature. In the sample code, C++11’s auto and decltype features are used to do this cast, which avoids redundantly re-defining the function’s type.

Step 4. Using the live-reloading

Now that all the code is ready, we can have fun with live-reloading. To do this, we launch the program “without debugging”, because that allows Visual Studio to build code even if the program is running.

First, make sure the main project is the startup project:

durpdupr.png

Next, start the project “without debugging“. You can do this by pressing “Ctrl+F5“, or you can go in the following menu:

fdsajhf.png

Now, the program should be running, and displaying the message we wrote earlier:

dhfwer.png

While this program is running, go to live_reloaded_code.cpp, and modify the message in the printf. After that, save the file, and build the project by pressing “Ctrl+Shift+B“, or by going in the following menu:

3hiods.png

After the build, you should see the output of your program change:

ejidf.png

That’s it! Have fun!

Known Issues

It might crash rarely depending on how you have it set up. I don’t know why. It has been reliable enough for me.

D3D12 Shader Live-Reloading

Introduction

I previously wrote about ShaderSet, which was my attempt at making a clean, efficient, and simple shader live-reloading interface for OpenGL 4.

Since ShaderSet was so fun to use, I wanted to have the same thing in my D3D12 coding. As a result, I came up with PipelineSet. This class makes it easy to live-reload shaders, while encapsulating the complexity of compiling pipeline state in a multi-threaded fashion, and also allowing advanced usage to fit your rendering engine’s needs.

Show Me The Code

In summary, the interface looks something like what follows. I tried to show how it fits into the design of a component-based renderer.

// Example component of the renderer
class MyRenderComponent
{
  ID3D12RootSignature** mppRS;
  ID3D12PipelineState** mppPSO;

public:
  void Init(IPipelineState* pPipeSet)
  {
    // set up your PSO desc
    D3D12_GRAPHICS_PIPELINE_STATE_DESC desc = { ... };

    // associate the compiled shader file names to shader stages
    GraphicsPipelineFiles files;
    // note: scene.vs.cso also contains root signature
    files.RSFile = L"scene.vs.cso";
    files.VSFile = L"scene.vs.cso";
    files.PSFile = L"scene.ps.cso";

    std::tie(mppRS, mppPSO) = pPipeSet->AddPipeline(desc, files);
  }

  void WriteCmds(ID3D12GraphicsCommandList* pCmdList)
  {
    if (!*mppRS || !*mppPSO)
    {
      // not compiled yet, or failed to compile
      return;
    }

    pCmdList->SetGraphicsRootSignature(*mppRS);
    pCmdList->SetPipelineState(*mppPSO);
    // TODO: Set root parameters and etc
    pCmdList->DrawInstanced(...);
  }
};

std::shared_ptr<IPipelineState> pPipeSet;

void RendererInit()
{
  pPipeSet = IPipelineSet::Create(pDevice, kMaximumFrameLatency);

  // let each component add its pipelines
  foreach (component in renderer)
  {
      component->Init(pPipeSet.get());
  }

  // Kick-off building the pipelines.
  // Can no longer add pipelines after this point.
  HANDLE hBuild = pPipeSet->BuildAllAsync();

  // wait for pipelines to finish building
  if (WaitForSingleObject(hBuild, INFINITE) != WAIT_OBJECT_0) {
    fprintf(stderr, "BuildAllAsync fatal error\n");
    exit(1);
  }
}

void RendererUpdate()
{
  // updates pipelines that have reloaded since last update
  // also garbage-collects unused pipelines after kMaximumFrameLatency updates
  pPipeSet->UpdatePipelines();

  foreach (component in renderer)
  {
    component->WriteCmds(pCmdList);
  }

  SubmitCmds();
}

The big idea is to add pipeline descs to the PipelineSet, and those descs don’t need to specify bytecode for their shader stages. Instead, the names of the compiled shader objects for each shader stage are passed through the “GraphicsPipelineFiles” or “ComputePipelineFiles” struct.

Each added shader returns a double-pointer to the root signature and pipeline state. This indirection allows the root signature and pipeline state to be reloaded, and also allows code to deal with the PipelineSet in an abstract manner. (It’s “just a double pointer”, not a PipelineSet-specific class.)

From there, BuildAllAsync() will build all the pipelines in the PipelineSet in a multi-threaded fashion, using the Windows Threadpool. When the returned handle is signaled, that means the compilation has finished.

Finally, you must call UpdatePipelines() at each frame. This does two things: First, it’ll update any pipelines and root signature that have been reloaded since the last update. Second, it garbage-collects any root signature and pipelines that are no longer used (ie. because they have been replaced by their new reloaded versions.) This garbage collection is done by deleting the resources only after kMaximumFrameLatency updates have passed. This works because it’s guaranteed that no more frames are in flight on the GPU with this pipeline state, since it exceeds the depth of your CPU to GPU pipeline.

The Workflow

IPipelineSet is designed to work along with Visual Studio’s built-in HLSL compiler. The big idea is to rebuild your shaders from Visual Studio while your program is running. This works quite conveniently, since Visual Studio’s default behavior for .hlsl files is to compile them to .cso (“compiled shader object”) files that can be loaded directly as bytecode by D3D12.

Normally, Visual Studio will force you to stop debugging if you want to rebuild your solution. However, if you “Start Without Debugging” (or hit Ctrl+F5 instead of just F5), then you can still build while your program is running. From there, you can make changes to your HLSL shaders while your program is running, and hit Ctrl+Shift+B to rebuild them live. The IPipelineSet will then detect a change in your cso files, and live-reload any affected root signatures and pipeline state objects.

To maintain bindings between shaders and C++, I used a so-called “preamble file” in ShaderSet. This preamble is not necessary with HLSL, since we can use its native #include functionality. Using this feature, I create a hlsli file (the HLSL equivalent of a C header) for the shaders I use. For example, if I have two shaders “scene.vs.hlsl” and “scene.ps.hlsl”, I create a third file “scene.rs.hlsli”, which contains two things:

  1. The Root signature, as #define SCENE_RS “RootFlags(0), etc”
  2. The root parameter locations, like #define SCENE_CAMERA_CBV_PARAM 0

I include this rs.hlsli file from my vertex/pixel shaders, then put [RootSignature(SCENE_RS)] before their main. From there, I pick registers for buffers/textures/etc using the conventions specified in the root signature.

I also include this rs.hlsli file from my C++ code, which lets me directly refer to the root parameter slots in my code that sets root signature parameters.

As an example, let’s suppose I want to render a 3D model in a typical 3D scene. The vertex shader transforms each vertex by the MVP matrix, and the pixel shader reads from a texture to color the model. I might have a scene.rs.hlsli as follows:

#ifndef SCENE_RS_HLSLI
#define SCENE_RS_HLSLI

#define SCENE_RS \
"RootFlags(ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT)," \
"CBV(b0, visibility=SHADER_VISIBILITY_VERTEX)," \
"DescriptorTable(SRV(t0), visibility=SHADER_VISIBILITY_PIXEL)," \
"StaticSampler(s0, visibility=SHADER_VISIBILITY_PIXEL)"

#define SCENE_RS_MVP_CBV_PARAM 0
#define SCENE_RS_TEX0_DESCRIPTOR_TABLE_PARAM 1

#endif // SCENE_RS_HLSLI

This code defines the root signature for use in HLSL. (See: Specifying Root Signatures in HLSL) The defines at the bottom correspond to root parameter slots, and they match the order of root parameters specified in the root signature string.

The vertex shader scene.vs.hlsl would then be something like:

#include "scene.rs.hlsli"

cbuffer MVPCBV : register(b0) {
    float4x4 MVP;
};

struct VS_INPUT {
    float3 Position : POSITION;
    float2 TexCoord : TEXCOORD;
};

struct VS_OUTPUT {
    float4 Position : SV_Position;
    float2 TexCoord : TEXCOORD;
};

[RootSignature(SCENE_RS)]
VS_OUTPUT VSmain(VS_INPUT input)
{
    VS_OUTPUT output;
    output.Position = mul(float4(input.Position,1.0), MVP);
    output.TexCoord = input.TexCoord;
    return output;
}

Notice that the register b0 is chosen so it matches what was specified in the root signature in scene.rs.hlsli. Also notice the [RootSignature(SCENE_RS)] attribute above the main.

From there, the pixel shader scene.ps.hlsl might look like this:

#include "scene.rs.hlsli"

Texture2D Tex0 : register(t0);
SamplerState Smp0 : register(s0);

struct PS_INPUT {
    float4 Position : SV_Position;
    float2 TexCoord : TEXCOORD;
};

struct PS_OUTPUT {
    float4 Color : SV_Target;
};

[RootSignature(SCENE_RS)]
PS_OUTPUT PSmain(PS_INPUT input)
{
    PS_OUTPUT output;
    output.Color = Tex0.Sample(Smp0, input.TexCoord);
    return output;
}

Again notice that the registers for the texture and sampler match those specified in the root signature, and notice the RootSignature attribute above the main.

Finally, I call this shader from my C++ code. I include the header from the source file of the corresponding renderer component, I set the root signature parameters, and make the call. It might be something similar to this:

#include "scene.rs.hlsli"

class SceneRenderer
{
    ID3D12RootSignature** mppRS;
    ID3D12PipelineState** mppPSO;

public:
    void Init(IPipelineSet* pPipeSet)
    {
        D3D12_GRAPHICS_PIPELINE_STATE_DESC desc = { ... };

        GraphicsPipelineFiles files;
        files.RSFile = L"scene.vs.cso";
        files.VSFile = L"scene.vs.cso";
        files.PSFile = L"scene.ps.cso";

        std::tie(mppRS, mppPSO) = pPipeSet->AddPipeline(desc, files);
    }

    void WriteCmds(
        BufferAllocator* pPerFrameAlloc,
        ID3D12GraphicsCommandList* pCmdList)
    {
        if (!*mppRS || !*mppPSO)
        {
            // not compiled yet, or failed to compile
            return;
        }

        float4x4* pCPUMVP;
        D3D12_GPU_VIRTUAL_ADDRESS pGPUMVP;
        std::tie(pCPUMVP, pGPUMVP) = pPerFrameAlloc->allocate(
            sizeof(float4x4), D3D12_CONSTANT_BUFFER_DATA_PLACEMENT_ALIGNMENT);

        *pCPUMVP = MVP; 

        pCmdList->SetGraphicsRootSignature(*mppRS);
        pCmdList->SetPipelineState(*mppPSO);

        pCmdList->SetGraphicsRootConstantBufferView(
            SCENE_RS_MVP_CBV_PARAM, pGPUMVP);

        pCmdList->SetGraphicsRootDescriptorTable(
            SCENE_RS_TEX0_DESCRIPTOR_TABLE_PARAM, Tex0SRV_GPU);

        /* TODO: Set other GPU state */
        pCmdList->DrawIndexedInstanced(...);
    }
};

There’s things going on here that aren’t strictly the topic of this article, but I’ll explain them anyways because I think it’s very useful for writing D3D12 code.

I use a big upload buffer each frame to write all my CBV allocations to, that’s the purpose of pPerFrameAlloc. Its allocate() function returns both a CPU (mapped) pointer and the corresponding GPU virtual address for the allocation, which allows me to write to the allocation from CPU, then pass the GPU VA while writing commands.

In this case, the per-frame allocation is an upload buffer, so I don’t need to explicitly copy from CPU to GPU (the shader will just read from host memory.) An alternate implementation could use an additional allocator for a default heap, and explicitly make a copy from the upload heap to the default heap.

The per-frame allocator is a simple lock-free linear allocator, so I can use it to make allocations from multiple threads, if I’m recording commands from multiple threads.

I could do something similar to the per-frame allocator for descriptors for the Tex0SRV_GPU, or I could create the descriptor once up-front in the Init(). It’s up to your choice, really.

When the time comes to finally specify the root parameters, I do it using the defines from the included scene.rs.hlsli, such as SCENE_RS_MVP_CBV_PARAM. This makes sure my C++ code stays synchronized to the HLSL code.

In Summary

IPipelineSet implements D3D12 shader live-reloading. It encapsulates the concurrent code used to reload shaders, and encapsulates the parallel code that accelerates PSO compilation through multi-threading. It integrates with code without that code needing to be aware of PipelineSet (it’s “just a double-pointer”), and garbage collection is handled efficiently and automatically. Finally, PipelineSet is designed for a workflow using Visual Studio that makes it easy to rebuild shaders while your program is running, and allows you to easily share resource bindings between HLSL and C++.

There are a bunch more advanced features. For example, it’s possible to supply an externally created root signature or shader bytecode, and it’s possible to “steal” signature/pipeline objects from the live-reloader by manipulating the reference count. See the comments in pipelineset.h for details.

You can download PipelineSet from GitHub: https://github.com/nlguillemot/PipelineSet

You can integrate it into your codebase by just adding pipelineset.h and pipelineset.cpp into your project. Should “just work”, assuming you have D3D12 and DXGI linked up already.

Comments, critique, pull requests, all welcome.

Intel GPU Assembly with PIX Beta

This is a short tutorial on how you can disassemble your HLSL shaders into Intel GPU (aka Gen) assembly using the newly released PIX tool.

I suspect many of these steps will be simpler in the future. If you’re reading this guide long after its publishing date, you can probably ignore most of the steps.

Step 1: Installing PIX

Download the PIX Beta: https://blogs.msdn.microsoft.com/pix/download/

Should be nothing surprising here.

Step 2: Installing Beta GPU Drivers

Install Beta Intel GPU drivers.

First, you must disable automatic driver updates, otherwise Windows will automatically uninstall your Beta drivers. There are instructions how to do this here: http://superuser.com/questions/964475/how-do-i-stop-windows-10-from-updating-my-graphics-driver

  • Side note: Disabling driver updates seems to be a really convoluted process. Windows is relentless in trying to stop me from using Beta drivers. It’s annoying, but hopefully this won’t be a problem in the future when mainstream drivers have the required features for PIX.

Next, uninstall your current graphics drivers. Open up the Device Manager, go under “Display adapters”, right click your drivers and choose “Uninstall”.

Next, install the Beta graphics drivers.

  1. Next, download the Beta Intel GPU drivers from here: https://downloadcenter.intel.com/product/80939/Graphics-Drivers
  2. Download the zip version of the drivers (not the exe), and unzip them.
  3. Back in the Device Manager, click “Action>Add legacy hardware”.
  4. Choose “Install the hardware that I manually select from a list (Advanced)”.
  5. Choose “Display adapters”.
  6. Click “Have Disk”.
  7. In the “Install From Disk” window, click “Browse”.
  8. Pick the inf file (eg: “igdlh64.inf”) from the “Graphics” folder of the drivers.
  9. Click “OK”, then pick the GPU model that corresponds to your computer.
  10. Keep clicking Next until it’s done installing.

Step 3: Disassembly in PIX

Run your program with PIX by setting the executable path and working directory. Unless you have a UWP app, you probably want to “Launch Win32”.

launch.PNG

From there, click “Launch”.

Next, press print screen, or click the camera button in PIX, to capture a frame of rendering. Double-click on the small picture of your capture that appears in PIX.

If you see an error popup here, it seems probably because your drivers are not updated enough (or more likely, that Windows automatically reverted your Beta update behind your back.)

Once your capture is open, click on the “Pipeline” tab. Then, click the “Click here to start analysis” text that appears in the window in the bottom half of PIX

clickhere.PNG

Next, click on the “Dispatch” or “Draw” event in the Events window at top left (seen in previous screenshot) for which you are interested in seeing the disassembled shaders. Click on the shader stage you want in the bottom left of the window, then click on “Disassembly” (as below). And voila! Gen assembly for your shader!

disasm.PNG

D3D12 Multi-Adapter Survey & Thoughts


Introduction

Direct3D 12 opens up a lot of potential by making it possible to write GPU programs that make use of multiple GPUs. For example, it’s possible to write programs that distribute work among multiple GPUs from linked GPUs (eg: NVIDIA SLI or AMD Crossfire), or even between GPUs from different hardware vendors.

There are many ways to make use of these multi-adapter features, but it’s not obvious yet (at least to me) how to best make use of it. In theory, we should try to make full use of all available hardware on a given computer, but there are difficult problems to solve along the way. For example:

  • How can we schedule GPU tasks to minimize communication overhead between different GPUs?
  • How can we distribute tasks among hardware that vary in performance?
  • How can we use special hardware features? eg: “free” CPU-GPU memory sharing on integrated GPUs.

D3D12 Multi-Adapter Features Overview

To better support multiple GPUs, Direct3D 12 brings two main features:

  1. Cross-adapter memory, which allows one GPU to access memory of other another GPU.
  2. Cross-adapter fences, which allows one GPU to synchronize its execution with another GPU.

Working with multiple GPUs in D3D12 is done explicitly, meaning that sharing memory and synchronizing GPUs must be taken into consideration by the rendering engine, as opposed to being “automagically” done inside GPU drivers. This should lead to more efficient use of multiple GPUs. Furthermore, integrating shared memory and fences into the API allows you to avoid making round-trips to the CPU to interface between GPUs.

For a nice quick illustrated guide to the features described above, I recommend the following article by Nicolas Langley: Multi-Adapter Support in DirectX 12.

D3D12 supports two classes of multi-adapter setups:

  1. Linked Display Adapters (LDA) refers to linked GPUs (eg: NVIDIA SLI/AMD Crossfire). They are exposed as a single ID3D12Device with multiple “nodes”. D3D12 APIs allow you to specify a bitset of nodes when the time comes to specify which node to use, or which nodes should share a resource.
  2. Multiple Display Adapters (MDA) refers to multiple different GPUs installed on the same system. For example, you might have both an integrated GPU and a discrete GPU in the same computer, or you might have two discrete GPUs from different vendors. In this scenario, you have a different ID3D12Device for each adapter.

Another neat detail of D3D12’s multi-adapter features is Standard Swizzle, which allows GPU and CPU to share swizzled textures using a convention on the swizzled format.

Central to multi-adapter code is the fact that each GPU node has its own set of command queues. From the perspective of D3D12, each GPU has a rendering engine, a compute engine, and a copy engine, and these engines are fed through command queues. Using multiple command queues can help the GPU schedule independent work, especially in the case of copy or compute queues. It’s also possible to tweak the priority of each command queue, which makes it possible to implement background tasks.

Use-Cases for Multi-Adapter

One has to wonder who can afford the luxury of owning multiple GPUs in one computer. Considering that multi-adapter wasn’t properly supported before D3D12, it was probably barely worth thinking about, other than scenarios explicitly supported by SLI/Crossfire. In this section, I’ll try to enumerate some scenarios where the user might have multiple GPUs.

“Enthusiast” users with multiple GPUs:

  • Linked SLI/Crossfire adapters.
  • Heterogeneous discrete GPUs.
  • Integrated + discrete GPU.

“Professional” users:

  • Tools for 3D artists with fancy computers.
  • High-powered real-time computer vision equipment.

“Datacenter” users:

  • GPU-accelerated machine-learning.
  • Engineering/physics simulations (fluids, particles, erosion…)

Another potentially interesting idea is to integrate CPU compute work in DirectX by using the WARP (software renderer) adapter. It seems a bit unfortunate to tie everyday CPU work into a graphics API. I guess it might lead to better CPU-GPU interop, or it might open opportunities to experiment with moving work between CPU and GPU and see performance differences. This is similar to using OpenCL to implement compute languages on CPU.

Multi-adapter Designs

There are different ways to integrate multi-adapter into a DirectX program. Let’s consider some options.

Multi-GPU Pipelining

Pipelining with multiple GPUs comes in different flavors. For example, Alternate Frame Rendering (AFR) consists of alternating between GPUs with each frame of rendering, which allows multiple frames to be processed on-the-fly simultaneously. This kind of approach generally requires the scene you’re rendering to be duplicated on all GPUs, and requires outputs of one frame’s GPU to be copied to the inputs to the next frame’s GPU.

AFR can unfortunately limit your design. For example, dependencies between frames can be difficult to implement efficiently. To solve this problem, instead of pipelining at the granularity of frames with AFR, one might pipeline within a frame. For example, half of the frame can be processed on one GPU, then finished on another GPU. In theory, these pipelining approaches should increase throughput, while possibly increasing latency due to the extra overhead of copying data between GPUs (between stages of the pipeline.) For this reason, we have to be careful about the overhead of copies

A great overview of multi-adapter, AFR, and frame pipelining was given in Juha Sjöholm’s GDC 2016 talk: Explicit Multi GPU Programming with DirectX 12

Task-Parallelism

With a good data-parallel division of our work, we can theoretically easily split our work into tasks, then distribute them among GPUs. However, there’s fundamentally a big difference in the ideal level of granularity of parallelism between low-latency (real-time) users and high-throughput (offline) users. For example, work that can be done in parallel within one frame is not always worth running on multiple GPUs, since the overhead of communication might nullify the gains. In general:

  • Real-time programs don’t have much choice outside of parallelism within one frame (or a few frames), since they want to minimize latency, and they can’t predict future user controller inputs anyways.
  • Offline programs might know the entire domain of inputs ahead of time, so they can arbitrarily parallelize without needing to use parallelism within one frame.

If our goal is to render 100 frames of video for a 3D movie, we could split those 100 frames among the available GPUs and process them in parallel. Similarly, if we want to run a machine learning classification algorithm on 1000 images, we can also probably split that arbitrarily between GPUs. We can even deal with varying performance of available GPUs relatively easily: Put the 1000 tasks in a queue, and let GPUs pop them and process them as fast as they allow, perhaps using a work-stealing scheduler if you want to get fancy with load-balancing.

In the case of a real-time application, we’re motivated to use parallelism within each frame to bring content to the user’s face as fast as possible. To avoid the overhead of communication, we might be motivated to split work into coarse chunks. Allow me to elaborate.

Coarse Tasks

To minimize the overhead of communication between GPUs, we should try to run large independent portions of the task graph on the same GPU. Parts of the task graph that run serially are an obvious candidate for running on only one GPU, although you may be able to pipeline those parts.

One way to separate an engine into coarse tasks is to split them based on their purpose. For example, you might separate your project into a GUI rendering component, a fluid simulation component, a skinning component, a shadow mapping component, and a scene rendering component. From there, you can roughly allocate each component to a GPU. Splitting code among high-level components seems like an obvious solution, but I’m worried that we’ll get similar problems as the “system-on-a-thread” design for multi-threading.

With such a coarse separation of components, we have to be careful to allocate work among GPUs in a balanced way. If we split work uniformly among GPUs with varying capabilities, then we can easily be bottlenecked by the weakest GPU. Therefore, we might want to again put our tasks in a queue and distribute them among GPUs as they become available. In theory, we can further mitigate this problem with a fork/join approach. For example, if a GPU splits one of its tasks in half, then a more powerful GPU can pick up the second half of the problem while the first half is still being processed by the first GPU. This approach might work best on linked adapters, since they can theoretically share memory more efficiently.

An interesting approach to load-balancing can be found in GPU Pro 7 chapter 5.4: “Semi-static Load Balancing for Low-Latency Ray Tracing on Heterogeneous Multiple GPUs”. It works by roughly splitting the framebuffer among GPUs to ray trace a scene, and alters the distribution of the split dynamically based on results of previous frames.

One complication of distributing tasks among GPUs is that we might want to run a task on the same GPU at each frame, to avoid having to copy the input state of the task to run it on a different GPU. I’m not sure if there’s an obvious solution to this problem, maybe it’s just something to integrate into a heuristic cost model for the scheduler.

A Note On Power

One quite difficult problem with multi-adapter has to do with power. If a GPU is not used for a relatively short period of time, it’ll start clocking itself down to save power. In other words, if you have a GPU that runs a task each frame then waits for another GPU to finish, it’s possible for that first GPU to start shutting itself down. This becomes a problem on the next frame, since the GPU will have to spin up once again, which takes a non-trivial amount of time. As a final result, the code ends up running slower on multi-adapter than it does in single-adapter, despite even the most obvious opportunities for parallelism.

One might suggest to force the GPU to keep running at full power to solve this problem. It’s not so obvious, since drawing power from idle cores takes away power from the cores that need it. This is especially an issue on integrated GPUs, since the GPU would steal juice from the CPU, despite the CPU probably needing that power to run non-GPU code during the rest of the frame. Of course, power-hungry applications are also generally not welcome on battery-operated devices like laptops or phones.

Does this problem have a solution? Hard to say! As a guideline, it might be important to use GPUs only if you plan to utilize them well, and be careful about CPU-GPU tradeoffs on integrated GPUs. We might need help from hardware and OS people to figure this out properly.

NUMA-aware Task Scheduling

An important challenge of multi-adapter code is that memory allocations have an affinity to a given processor, which means that the cost of memory access increases dramatically when the memory does not belong to the processor accessing it. This scenario is known as “Non-uniform memory access”, aka. “NUMA”. It’s a common problem in heterogeneous and distributed systems, and is also a well-known problem in server computers that have more CPU cores than a single motherboard socket can support, which result in multi-socket CPU configurations where each socket/CPU has a set of RAM chips closer to it than others.

There exist some strategies to deal with scheduling tasks in a NUMA-aware manner. I’ll list some from the literature.

Deferred allocation is a way to guarantee that output memory is local to the NUMA node. It simply consists of allocating the output memory only at the time of the task being scheduled, which allows the processor that was scheduled to perform the allocation right-then-and-there in its local memory, thus guaranteeing locality.

Work-pushing is a method to select a worker to which a task should be sent. In other words, it’s the opposite of work-stealing. The target worker is picked based on a choice of heuristic. For example, the heuristic might try to push tasks to the node that owns the task’s inputs, or it might try to push work to the node that own’s the task’s outputs, or the heuristic might combine ownership of inputs and outputs in its decision.

Work-stealing can also be tweaked for NUMA purposes, by tweaking the work-stealing algorithm to first steal work from nearby NUMA nodes first. This might apply itself naturally to the case of sharing work between linked adapters.

Conclusion

Direct3D 12 enables much more fine-grained control over use of multiple GPUs, whether though linked adapters or through heterogeneous hardware components. Enthusiast gamers, professional users, and GPU compute datacenters stand to benefit from good use of this tech, which motivated a search for designs that use multi-adapter effectively. On this front, we discussed Alternate-Frame-Rendering (AFR), and discussed the design of more general task-parallel systems. The design of a task-parallel engine depends a lot on your use case, and there are many unsolved and non-obvious areas of this design space. For now, we can draw inspiration from existing research on NUMA systems and think about how it applies to the design of our GPU programs.

Using cont with tbb::task_group

Note: Previous post on this topic: https://nlguillemot.wordpress.com/2017/01/12/tbb-task-dag-with-deferred-successors/

In the last post, I showed a proof of concept to implement a “cont” object that allows creating dependencies between TBB tasks in a dynamic and deferred way. What I mean by “dynamic” is that successors can be added at runtime (instead of requiring the task graph to be specified statically). What I mean by “deferred” is that the successor can be added even after the predecessor was created and spawned, in contrast to interfaces where successors need to be created first and hooked into their predecessor secondly.

The Goal

The goal of this post was to create an interface for cont that abstracts TBB details from everyday task code. TBB’s task interface is low level and verbose, so I wanted to have something productive and concise on top of it.

Extending tbb::task_group

tbb::task_group is a pretty easy way to spawn a bunch of tasks and let them run. An example use is as follows:

int Fib(int n) {
    if( n<2 ) {
        return n;
    } else {
        int x, y;
        task_group g;
        g.run([&]{x=Fib(n-1);}); // spawn a task
        g.run([&]{y=Fib(n-2);}); // spawn another task
        g.wait();                // wait for both tasks to complete
        return x+y;
    }
}

I wanted to reuse this interface, but also be able to spawn tasks that depend on conts. To do this, I made a derived class from task_group called cont_task_group. It supports the following additional syntax:

cont<int> c1, c2;
cont_task_group g;
g.run([&]{ foo(&c1); };
g.run([&]{ bar(&c2); };
g.with(c1, c2).run([&] { baz(*c1, *c2); });
g.wait();

The with(c...).run(f) syntax spawns a task to run the function f only when all conts in c... are set.

A full example is as follows:

void TaskA(cont<int>* c, int x)
{
    tbb::task_group g;
    g.run([&] {
        // A Subtask 1
        c->emplace(1337);
        c->set_ready();
    });
    g.run_and_wait([&] {
        // A Subtask 2
    });
}

void TaskB(int y)
{
}

void TaskC(int z)
{
    std::stringstream ss;
    ss << "TaskC received " << z << "\n";
    std::cout << ss.rdbuf();
}

int main()
{
    cont<int> c;
    cont_task_group g;
    g.run([&] { TaskA(&c, 3); });
    g.run([&] { TaskB(2); });
    g.with(c).run([&] { TaskC(*c); });
    g.wait();
}

This builds the following task dependency graph:

task graph

Sample implementation here: GitHub