Rasterizer Notes

Non peer-reviewed personal notes on rasterizer implementation.

Overall Rasterizer Design

These notes are based on the rasterizer designs discussed in “Rasterization on Larrabee” by Michael Abrash [1], and Fabian Giesen’s “Optimizing Software Occlusion Culling” [2] and “A Trip through the Graphics Pipeline 2011” [3]. Consider these materials are required pre-reading for this stuff to make sense. Other very useful materials are Tomas Akenine-Möller’s “Some notes on Graphics Hardware” [4], Russ Brown’s “Barycentric  Coordinates  as Interpolants” [5], Mik Bry’s “Coping with Fixed Point” [6], and Kurt Akeley and Pat Hanrahan’s “Real-Time Graphics Architecture” course slides [7].

Fixed Point

Two main fixed point formats are useful here:

• 16.8: Used for all window coordinates, and for edge equations (since those are a product of window coordinates).
• The product of two 16.8 requires 48 bits, which means you can safely do them with 64 bits.
• Various tricks are used to get away with storing the product of two 16.8 with 32 bits (eg: guaranteeing small enough values.)
• 16.16 aka s15.16: Generally good format for doing math like modeling and viewing transformations. Clip space coordinates should be in this format. This is the format used by OpenGL’s $\texttt{OES\_fixed\_point}$ extension, which makes it somewhat industry-standard.
• Note the “s” means “sign bit”, but the number itself is still using twos’ complement as usual. It’s basically just a reminder that you can’t safely shift out that last bit, otherwise sign extension won’t work when eg. doing a right shift.
• Sometimes you end up with weird intermediate formats. For example, you might take two 16.8s and multiply them to get a 32.16 value (which might be safely stored in a 32-bit integer anyways by being careful about the bounds of your inputs.)
• Those high-precision intermediate values generally get right-shifted back into the 2 standard representations above when the math is done.
• Beware of getting the rounding right, it’s a good idea to make reusable functions so you don’t have to remember to manually do proper rounding every time you do fixed point math.
• However, be aware that special rounding methods are used in rasterization, such as the top-left rule (also discussed in this article.)

Clipping

• For clipping edges against the near plane with a $0 < z < w$ clip space, compute the vertex intersecting the near plane as follows:
• $a = \dfrac{z_1}{(z_1 - z_2)}$
• $x = (1 - a) * x_1 + a * x_2$
• $y = (1 - a) * y_1 + a * y_2$
• $z = 0$
• $w = (1 - a) * w_1 + a * w_2$
• For the far plane, as follows (note: assuming vertex 2 is the far clipped one):
• $a = \dfrac{(z_1 - w_1)}{(z_1 - w_1) - (z_2 - w_2)}$
• $x = (1 - a) * x_1 + a * x_2$
• $y = (1 - a) * y_1 + a * y_2$
• $\textbf{w} = (1 - a) * w_1 + a * w_2$
• $\textbf{z} = w - 0.000...0001$
• subtract smallest fixed point difference from lerped w.
• this ensures $\dfrac{z}{w} < 1$.
• Not when computing $\dfrac{z}{w}$ you should still clamp the result to the valid range, since rounding can make it go out of bounds again.
• See “Clipping Using Homogeneous Coordinates” (Blinn, 1978) for derivation of clip plane intersections.
• If 3 vertices are behind the near plane, drop the whole triangle.
• If 2 vertices are behind the near plane, compute the 2 near plane intersections and output that triangle instead.
• If 1 vertex is behind the near plane, compute the 2 near plane intersections and output the 2 triangulated visible triangles. (eg: recursive call or loop the whole function to rasterize the second triangle).

Tiles

• Each tile is morton-code swizzled. The framebuffer’s tiles themselves are row major, but their inside is swizzled.
• It’s ugly to make operations on morton-code work for scalar code, but it’s elegant with SIMD code if you pre-prepare some eg. 2×2 stamps. For example you can process a 2×2 quad in parallel while simultaneously avoiding having to decode the morton-code back into individual x/y components as would be necessary for scalar execution.

Edge Equations

• Use round-to-infinity or round-to-negative-infinity for the initial evaluation of the edge equations. We round to negative infinity because it gives us a more conservative coverage.
• Evaluate the edge equation at pixel (0.5, 0.5) relative to the top-left of the tile.
• This means you’ll test for coverage at the center of each pixel, which is consistent with DirectX rules.
• Offset the result of that edge equation with every step of the rasterizer to evaluate other pixels’ centers.

Small Triangles

• Small triangles can cover up to 2×2 tiles (worst case: triangle in the middle of 2×2 tiles).
• Some of these “tiles” may be outside the framebuffer, if a triangle is at the edge of the screen. Clamp judiciously.
• Small triangles can also cover 1×1 tiles, 2×1 tiles, or 1×2 tiles.
• In all cases (including 2×2), you can set the bottom right tile’s top-left corner to be (0,0) and translate your vertices relative to that.
• If your tile sizes are <= 128 by 128, then the individual multiplicands of your edge equation cross products for small triangles should look like:
• $\texttt{0x0000YXXX}$ (positive numbers)
• or…
• $\texttt{0xFFFFZXXX}$ (negative numbers)
• where:
• X is a hex digit
• Y is a hex digit with a MSB of 0 (twos’ complement “sign bit”).
• Z is a hex digit with a MSB of 1 (twos’ complement “sign bit”).
• This situation means you can multiply those two numbers in 32 bits without overflow.
• Since small triangles are at most 128 in size, their area has a limited range. This helps us in computing $\dfrac{1}{2\Delta Area}$, which is required for barycentric interpolation. I’ve explored 2 ways to compute this recriprocal, one using fixed point and one using floating point.

Small Triangle Barycentrics – Fixed Point

• In 16.8 fixed point, the maximum value for $2\Delta Area$ is $\texttt{0x7F.FF * 0x7F.FF}$ (similar to 127.99999…. * 127.99999….).
• The result of the above expression, multiplying two 8.8 numbers, requires at most 16.16 bits, which conveniently fits in a 32 bit integer.
• However, notice that both those 8.8 numbers have their MSB set to 0, to indicate a positive signed value. Thus, the result of that multiplication is $\texttt{0x3FFF.0001}$, which, when converted to 16.8, gets rounded down to $\texttt{0x3FFF.00}$. Notice the first two MSBs are 0, which represents the bits allocated for the two sign bits of the multiplicands.
• If a triangle has zero area, we discard it. If a triangle has negative area, that means it’s back-facing, so we flip two of its vertices (and negate its area) to make it front-facing again. Now, in all cases, we only have to consider triangles with positive area.
• When computing $\dfrac{1}{2\Delta Area}$, the result will be 0 if not enough precision is used. This happens if the triangle area is too big, making the fraction too small to represent. In that case, all barycentric interpolation will always give a result of 0, since barycentrics are evaluated as $bary_i * \dfrac{1}{2\Delta Area} =$. Therefore, we should perform that division with enough precision that the result does not go to zero for the range of possible triangle areas.
• To do this, we can compute $\texttt{0x400000 / triarea2}$. Since the numerator is always bigger than the denominator, and since the denominator is always greater than zero, the result of this expression ranges from 1 (with $\texttt{triarea2 == 0x3FFF00}$), to $\texttt{0x400000}$ (with $\texttt{triarea2 == 1}$). This means we always have enough precision to store the result in 24 bits, while leaving the sign bit intact. To be specific, this is a signed 0.24 fixed point encoding.
• Problem: look at a plot of $\dfrac{1}{x}$:
• As you can see, a very small range of values is allocated to triangles with an area greater than 1 (which represents the large majority of triangles).
• Because of this, you get dreadfully poor precision if you try to use this value for barycentric coordinates, and if you try to invert the graph, then you’ll still get bad precision for small triangles instead. Could opt to just ignore small triangles, aka “slivers”, since they don’t have much detail anyways…
• Inherently it seems we’re trying to map an exponentially varying function to a uniform precision, which is bound not to work.
• One solution I haven’t tried is to treat <1 and >1 differently, since they’re both linear-ish after the inflection point.

Small Triangle Barycentrics – Denormalized Floating Point

• Since a tile can only be as big as 128×128, that means the longest edge possible is the length of the diagonal of such a tile. By triangle inequality we know the length of that edge is less than 256, which means we can always represent unit steps along a small triangle’s edge with 8 bits.  Therefore, we can represent this fraction as a 1.16 fixed point integer, where the 1 integer bit serves the case of $2\Delta Area = \texttt{1.0x0000}$, whose reciprocal is also $\texttt{1.0x0000}$.
• When normalizing the barycentric coordinates as $e_i * \dfrac{1}{2\Delta Area}$, the edge equations have the format 16.8. This means the result of this multiplication is 17.16. Oops, no longer fits in 32 bits. We need to lower the number of bits in the edge equation for this multiplication to work, but the ideal number of bits to reduce depends on the size of the triangle! For example, a triangle smaller than 1 pixel encoded as 16.8 already is effectively 0.8 due to the integer bits being 0.
• In order to handle both cases of large and small triangles, a floating-point style exponent is computed based on the number of leading zeros in $2\Delta Area$ (using an instruction such as Haswell’s $\texttt{lzcnt}$.) This exponent is used to shift the 16.8 $2\Delta Area$ into a normalized 1.16 representation (“normalized” means the msb is 1).
• From there, we compute the reciprocal of this lower precision area (touching only the mantissa) this gives us a reciprocal with decent behavior no matter the scale of the numbers! Alas, with only 16 bits of precision.
• Note most values have a msb of 0, which means they are “denormal” (or “subnormal”) floating point numbers. In that case, the reciprocal value only requires 16 bits, which will happen for triangles with an area greater than 1.
• If the value has a msb of 1, that means the triangle’s area is smaller than one. Since interpolation errors are harder to notice in such small triangles, we can actually just denormalize the mantissa again (making the msb 0). This reduces the memory requirement from 17 bits (1.16) to 16 bits. Note this shift needs to be taken into account by the exponent.
• Now, in both cases, we have a denormal value, and only 16 bits are required.
• When computing the barycentric, we shift the edge equation by the same exponent used to originally shift $2\Delta Area$. This makes the 16.8 edge equation into a denormal 0.16 value, similarly to the area. By multiplying this denormal edge equation with the reciprocal double triangle area, we can get a value that fits in 16 bits.
• From there, you can use these barycentrics to compute a per-pixel depth value by interpolating the vertices’ $\dfrac{z}{w}$ depths. Since $\dfrac{z}{w}$ is 16 bits, the result of that multiplication fits neatly in 32 bits.
• It also slightly helps to rotate the triangle’s vertices (during setup) so that vertex 0 (in the computation of $d = d_0 + u (d_1 - d_0) + v (d_2 - d_0)$) has the biggest slope. This reduces the values of $u$ and $v$, making it easier to not run out of precision.

Small Triangle Performance

• If you take a model like the teapot, the dragon, the Buddha… those models are made 100% out of small triangles for anything but the most extreme close-up viewpoints. You better have very efficient processing of small triangles, many of which cover only a few pixels on the screen. Consider setting up small triangles in parallel (SIMD?) and skipping right to rasterizing a single fine block or a single coarse block if the triangle is very small.
• Doing trivial reject on small triangles at the coarse block/fine block level seems to make a big difference in performance
• Vectorizing small triangle rasterization is a very helpful. observed 2x speedup from AVX2 implementation.

Large Triangles

• When implementing trivial accept, you can “rotate” the triangle’s vertices and edges to put the non-trivially accepted edges first in the triple of vertices/edges.
• Later, when you need to test against the N non-trivially-accepted edges, you can then test edges 0 to N-1 (inclusive) without having to store the first non-accepted edge or doing modulos.
• When doing edge tests, you MUST have already done the trivial reject and trivial accept tests.
• By only testing non-rejected non-trivially-accepted edges, it means you’re only testing edges which intersect the tile. That means the edge is at most 128 pixels away from any point being tested. That means you can do your edge tests with 32 bits, similar to the small triangle case. You still do the initial setup with 64 bits, but the per-coarse block and per-pixel tests are done with 32 bits.

Large Triangle Barycentrics

• If an edge is not trivially accepted, then its edge equation can be safely computed over the whole tile with no risk of overflow. From there, the barycentric coordinate of that edge equation can be computed by shifting it by the exponent and multiplying by reciprocal 2area, as described earlier in “Small Triangle Barycentrics”.
• If an edge is trivially accepted, then it’s possible for the edge equation’s value to overflow while being evaluated in the tile. This would break barycentric coordinates.
• To fix this, compute some exponent-shifted edge equations per-tile during triangle setup, and set the initial value for the edge equation to 0.
• By initializing the edge equation t0 0 at the corner of the tile, you’re guaranteed it won’t overflow, since the total area of a tile fits in 32 bits. This makes the edge equation a value relative to the corner of the tile, rather than an absolute value.
• To retrieve the absolute value, compute the exponent-shifted relative edge equation per pixel, then add to it the exponent-shifted edge equation that was computed per-tile during setup.
• Note that this is only necessary for edges that are trivially accepted. Non trivially accepted edges are guaranteed not to overflow anyways.

Large Triangle Performance

• Large triangles are very rare, except maybe architecture like floor and walls, or close-up objects. All the cost is in filling in large areas of pixels.

Depth Testing

• Depth is stored in the depth buffer as $\dfrac{z}{w}$, which means you don’t have to convert the depth back to eye space before storing it (unlike most vertex attributes).
• This means you don’t need perspective-correct interpolation of depth, so you can directly treat the edge equations (divided by 2x triangle area) as the barycentric coordinates.
• Also, if you properly implement near/far plane clipping, then you’re guaranteed that all post-clip values will satisfy $0 \leq \dfrac{z}{w} \leq 1$. This fact is useful for reasoning about the required number of fixed point bits to represent some computations.

Visualization of depth for the Stanford Dragon model

Perspective-Correct Barycentric Coordinates

• The raw edge equations aren’t perspective correct barycentric coordinates, because they need to be divided by w to be put back into eye space.
• If your barycentrics are perspective-correct, then you don’t need to divide by $\dfrac{1}{w}$ to bring attributes back into eye space. This is useful if you have many attributes and can reuse the work of computing perspective-correct barycentric coordinates.

Command Buffers

When implementing a command ring buffer, consider the following invariants (good for writing asserts):

• Command ring buffer has 4 members:
1. start ptr: address of first memory location in buffer.
2. end ptr: past-the-end address of buffer.
4. write ptr: address where the next command will be written.
• A command is a contiguous array of dwords in the command buffer.
• That way you can cast the memory address to a struct when interpreting it.
• Initially, read ptr and write ptr are at the same location in memory.
• The write ptr cannot “catch up” to the read ptr. The read ptr can “catch up” to the write ptr.
• Otherwise there is an ambiguity about the meaning of read/write ptr being at the same location.
• The read ptr should never be at past-the-end of the buffer.
• It’s impossible for there to be something for it to read at that point anyways.
• The write ptr should never stay at the past-the-end location.
• Otherwise the read ptr will “catch up” to it there, violating the previous invariant.
• This might be allowed in an intermediate state, in order to allow fully using the buffer without making write ptr catch up to read ptr.
• Whenever setting the write ptr back to the start ptr, consider if the read ptr is already set to start ptr.
• Since write ptr is not allowed to catch up to read, this requires a flush.
• This is generally the case where you filled up a whole buffer of commands without flushing.
• There are three things that need to be fixed before being able to write a command:
1. The command must be able to fit in the command buffer in the first place (can be statically assured).
2. If the read ptr is in front of the write ptr, there must be enough room in between them to write the command (+1, so the write ptr doesn’t catch up to the read ptr)
3. The end of the buffer is not too close to the write ptr.
• You can give up on the insufficiently-sized “slop” at the end of the buffer by dropping a 1-dword meta-command that resets the read ptr to the start ptr when interpreted.
• When giving up slop to go back to start ptr, consider if the read ptr is already at start ptr. After setting the write ptr to start ptr, consider if read ptr is in the way again (as in condition 2).
• Have some code available (eg: commented out) to visualize the buffer and where the read/write ptrs are. You can print this out while reading/writing commands to see where things are crashing or getting stuck in infinite loops.