To be precise, this is running in a compute shader (rasterizeSwPass.wgsl.ts for ...

To be precise, this is running in a compute shader (rasterizeSwPass.wgsl.ts for the curious). You can think of that as running the GPU in a mode where it's a type of computer with some frustrating limitations, but also the ability to efficiently run thousands of threads in parallel.

This is in contrast to hardware rasterization, where there is dedicated hardware onboard the GPU to decide which pixels are covered by a given triangle, and assigns those pixels to a fragment shader, where the color (and potentially other things) are computed, finally written to the render target as a raster op (also a bit of specialized hardware).

The seminal paper on this is cudaraster [1], which implemented basic 3D rendering in CUDA (the CUDA of 13 years ago is roughly comparable in power to compute shaders today), and basically posed the question: how much does using the specialized rasterization hardware help, compared with just using compute? The answer is roughly 2x, though it depends a lot on the details.

And those details are important. One of the assumptions that hardware rasterization relies on for efficiency is that a triangle covers dozens of pixels. In Nanite, that assumption is not valid, in fact a great many triangles are approximately a single pixel, and then software/compute approaches actually start beating the hardware.

Nanite, like this project, thus actually uses a hybrid approach: rasterization for medium to large triangles, and compute for smaller ones. Both can share the same render target.

[1]: https://research.nvidia.com/publication/2011-08_high-perform...