Pretty general question, but what has your approach been for coupling ThreeJS + React w/ a Rust/Wasm kernel for mesh generation? E.g. do you have Wasm own the memory and you give ThreeJS views of the memory to upload to GPU?
We don't need to update the scene at once, so when geometry changes we rebuild the buffer and send it over to typescript, and the three.js mesh will own the buffer. By being careful about what needs to be updated, we can keep things interactive.
Currently we simply copy a slice of the heap's ArrayBuffer from WASM to JS. In the past we exposed the heap slice directly but it was technically "unsafe" (perhaps because the heap can grow), and doing a copy did not hurt performance in any measurable way.
Author of the WebGL volume rendering tutorial [0] you mentioned in the readme here, great work!
Working in WebGL/JS is nice since you can deploy it everywhere, but it can be really hard for graphics programming as you've found because there are very few tools for doing real GPU/graphics debugging for WebGL. The only one I know of is [1], and I've had limited success with it.
WebGPU is a great next step, it provides a modern GPU API (so if you want to learn Metal, DX12, Vulkan, they're more familiar), and modern GPU functionality like storage buffers and compute shaders, not to mention lower overhead and better performance. The WebGPU inspector [2] also looks to provide a GPU profiler/debugger for web that aims to be on par with native options. I just tried it out on a small project I have and it looks really useful. Another benefit of WebGPU is that it maps more clearly to Metal/DX12/Vulkan, so you can use native tools to profile it through Chrome [3].
I think it would be worth learning C++ and a native graphics API, you'll get access to the much more powerful graphics debugging & profiling features provided by native tools (PIX, RenderDoc, Nvidia Nsight, Xcode, etc.) and functionality beyond what even WebGPU exposes.
Personally, I have come "full circle": I started with C++ and OpenGL, then DX12/Vulkan/Metal, then started doing more WebGL/WebGPU and JS/TS to "run everywhere", and now I'm back writing C++ but using WebGL/WebGPU and compiling to WebAssembly to still run everywhere (and native for tools).
With WebGPU, you could program in C++ (or Rust) and compile to both native (for access to debuggers and tools), and Wasm (for wide deployment on the web). This is one of the aspects of WebGPU that is most exciting to me. There's a great tutorial on developing WebGPU w/ C++ [4], and a one on using it from JS/TS [5].
Wow! First of all, thank you for your amazing blog posts and tutorials! I wouldn't have been able to make it this far without them. Seriously, I was stuck for so long until a random Google search linked me to that WebGL ray-casting article you wrote. (I'd pin your comment if I could.)
The funny thing is that I was getting more confident about using JS + WebGL/WebGPU ecosystem for graphics programming after having read your posts. Very interesting to hear that you've come full circle back to Cpp + WebGL/WebGPU + WebAssembly. I'll look more closely to assess options as I head down this journey. Thank you for your tips and advice!
Edit: Perhaps you'd find my "What is WebGPU" video on YouTube interesting. I'd love to get it fact-checked by someone who's been doing WebGl/WebGPU way longer than most people! I only got into this field ~2 years ago.
Sure I'd be happy to check it out, my email's in my profile (or Github/website).
There are some tradeoffs w/ WebAssembly as well (not sharing the same memory as JS/TS is the biggest one) and debugging can be a bit tough as well though now there's a good VSCode plugin for it [0]. Another part of the reason I also moved back to C++ -> Wasm was for the performance improvement from Wasm vs. JS/TS, but the cross compilation to native/web was the main motivator.
It's interesting to hear that Cpp is faster even though there is an overhead of moving data from WASM <-> JS/TS. I'm not yet ready to "take the leap" to learn Cpp + Metal + XCode + WASM because those are some big hurdles to jump through (especially just in my free time), but you do raise some good points.
I'm certain you could turn this knowledge into a blog post and help many more engineers who are silently struggling through this path. Self-studying graphics programming is tough!
It should pop up first on YouTube if you search "What is WebGPU Suboptimal Engineer", but I'll link it here[0] in case anyone else wants to watch it. (No need for you to actually fact-check it. I didn't mean to put random work on your plate on a Sunday haha.)
Yeah! I've started to look some WebGPU compute applications with other students in my group and I think there could be some cool use cases, like the ones you mention. It sounds a bit odd, but yeah WebGPU on native (by directly using Dawn, or wgpu-rs) is actually pretty compelling as a cross-platform low-level graphics API.
What's really cool is that compute and rendering using WebGPU can get near-native level performance. So a lot of scientific applications (which typically rely on more FLOPs/parallel processing) can be implemented in WebGPU compute without sacrificing much performance. I'm not sure how many simulations would be ported to WebGPU, since they usually end up targeting large scale HPC systems and CUDA, but for visualization applications I think the use case is pretty compelling, especially for portability and ease of distribution. On the compute side, I implemented a data-parallel Marching Cubes example: https://github.com/Twinklebear/webgpu-experiments , and found the performance is on par with my native Vulkan version. You can try it out here: https://www.willusher.io/webgpu-experiments/marching_cubes.h... . There is a pretty high first-run overhead, but try moving the slider around some to see the extraction performance after that. WebGPU for parallel compute, combined with WebASM for serial code (or just easily porting older native libs), will make the browser a lot more capable for compute heavy applications. You could also combine these more capable browser clients with a remote compute server, where the server can do some heavier processing while the client can do medium scale stuff to reduce latency or work on representative subsets of the data.
SIMD is used a ton in rendering applications and starting to see more use in games too (through ISPC for example).
I'd add to the list:
- Embree: https://www.embree.org/ Open source high-performance ray tracing kernels for CPUs using SIMD.
- OpenVKL: https://www.openvkl.org/ Similar to Embree (high-performance ray tracing kernels), but for volume traversal and sampling.
- ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code
- OSPRay: http://www.ospray.org/ A large project using SIMD throughout (via ISPC) for real time ray tracing for scientific visualization and physically based rendering.
- Open Image Denoise: https://openimagedenoise.github.io/ An open-source image denoiser using SIMD (via ISPC) for some image processing and denoising.
- (my own project) ChameleonRT: https://github.com/Twinklebear/ChameleonRT has an Embree + ISPC backend, using Embree for SIMD ray traversal and ISPC for vectorizing the rest of the path tracer (shading, texture sampling).
Starting to see? Back in Ye Olde 586 Days of the late 1990s, MMX was added to the Pentium architecture pretty much exclusively for 3D games and real-time audio/video decoding. (This was back when the act of playing an MP3 was no small chore for the average consumer CPU.) Intel made quite a big deal over MMX including millions of dollars in TV ads aimed at the general population, despite the fact that software had to be built specifically to use MMX and that only certain kinds of software could benefit from it.
"MMX was useless for games. MMX is Integer math only, good for DSP, things like audio filters, or making a softmodem out of your sound card. Unsuitable for accelerating 3D games. Whats worse MMX has no dedicated registers, and instead reuses/shares FPU ones, this means you cant use MMX and FPU (all 3D code pre Direct3D 7 Hardware T&L) at the same time.
...
Funnily enough AMDs 1998 3DNow! did actually add floating point support to MMX and was useful for 3D acceleration until hardware T&L came along 2 years later.
Intel Paid few dev houses to release make believe MMX enhancements, like POD (1997)
1/6 of box covered with Intel MMX advertising while game used it only for some sound effects. Intel repeated this trick in 99 while introducing Pentium 3 with SSE. Intel commissioned Rage Software to build a demo piece showcasing P3 during Comdex Fall. It worked .. by cheating with graphic details ;-) Quoting hardware.fr "But looking closely at the demo, we notice - as you can see on the screenshots - that the SSE version is less detailed than the non-SSE version (see the ground). Intel would you try to roll the journalists in the flour?". Of course Anandtech used this never released publicly cheating demo pretending to be a game in all of their Pentium 3 tests for over a year.
MMX was one of Intel's many Native Signal Processing (NSP) initiatives. They had plenty of ideas for making PCs dependent on Intel hardware, something Nvidia is really good at these days (physx, cuda, hairworks, gameworks). Thankfully Microsoft was quick to kill their other fancy plans https://www.theregister.co.uk/1998/11/11/microsoft_said_drop... Microsoft did the same thing to Creative with Vista killing DirectAudio, out of fear that one company was gripping positional audio monopoly on their platform.
> ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code
I've been learning ispc lately and it does seem like a wonderful solution, you avoid having to build separate implementations for every instruction set and/or worrying about per-compiler massaging to get it to recognise the vectorisation opportunities. The arguments for having a domain-specific language variant and why it was written (https://pharr.org/matt/blog/2018/04/30/ispc-all.html is a good read) seem like persuasive arguments.
However, outside of the projects in the above list - it doesn't seem to have very wide usage. There are still commits coming in/responding to some issues so it doesn't seem dead, but there are many issues untouched or just untriaged. There isn't much discussion about using it, or people asking for advice. The mailing list has about a message a month.
Is it merely just an extremely highly specialised domain? Is it just that CUDA/OpenCL is a more efficient solution for most cases where one would otherwise consider it? Are there too many ASM/intrinsic experts out there to bother learning?
ISPC is really awesome, but you're right it is much less known than CUDA/OpenCL. Part of that might just be lack of marketing effort and focus (you don't hear much about it compared to e.g. CUDA) and the team working on it is far smaller than that on CUDA. There has been some wider adoption, like Unreal Engine 4 using it now: https://devmesh.intel.com/projects/intel-ispc-in-unreal-engi... which is super cool, so hopefully we'll see more of that.
As far as support from other languages I did write this wrapper for using ISPC from Rust https://github.com/Twinklebear/ispc-rs (but that's just me again), and there has been work on a WebASM+SIMD backend which is really exciting. Intel does also have an ISPC based texture compressor (https://github.com/GameTechDev/ISPCTextureCompressor) which I think does have some popularity.
However, the domain is pretty specialized, and I think the fraction of people who really care about CPU performance and are willing to port or write part of their code in another language is smaller still. It's also possible that a lot of those who would do so have their own hand written intrinsics wrappers already. Migrating to ISPC would reduce a lot of maintenance effort on such projects, but when they already have momentum in the other direction it can be harder to switch. I think that on the CPU ISPC is easier and better than OpenCL for performance and tight integration with the "host" language, since you can directly share pointers and even call back and forth between the "host" and "kernel".
At work, I had a project involving a DSL for Monte Carlo simulations. The DSL was an internal DSL in Scala, our interpreter was in Scala, and we transpiled to ISPC (for servers/VMs that didn't expose a GPU) and OpenCL.
I generally liked ISPC, but I really didn't like that it tried to look as close as possible to C but departed from C in unnecessary ways. With Monte Carlo simulations, we deal with a lot of probabilities represented as doubles in the range [0.0, 1.0]. The biggest pain is that operations between a double and any integral type cast the double to the integral type, whereas in C, the integral type gets implicitly cast to a double. I understand the implicit casting rules were changed to give the fastest speed rather than minimize worst-case rounding error. I could understand getting rid of implicit casts, or maybe I could understand changing rules to improve accuracy and know that the user could easily use a profiler to discover any performance problems this caused. However, in our case, uint32_t * double = (uint32_t) 0, which then would get implicitly cast back to a double if being assigned to a variable. My interne was beating his head against the wall for the better part of an afternoon before I gave him a bit of debugging help. All of his probabilities were coming out 0% and 100% for his component.
I actually emailed the authors with a bug report when I found the implicit casting rules differed so radically from C and were in the direction away from accuracy. (Note there's no rounding error when converting uint32_t to a 64-bit IEEE-754 double.) They were very nice, and pointed us to where this behavior was documented.
If you're going out of your way to make your language look like C and interoperate seamlessly with C, you should have really strong justifications for the places where you radically depart from C's semantics.
Is it? I haven't heard about it actually being popular anywhere. It definitely works well, but I haven't seen it talked about much except in the case of embree, Intel's ray tracing library. It doesn't seem like there is any funding for it, though it actually works so well already it doesn't seem to need big leaps in progress to be valuable.
Not necessarily. There are implementations which don't even take advantage of 4/8 byte copying. We wanted to have something uniform. But yes, you are right with glibc or macOS.
Also, from the strncpy man page:
strlcpy()
Some systems (the BSDs, Solaris, and others) provide the following function:
size_t strlcpy(char *dest, const char *src, size_t size);
This function is similar to strncpy(), but it copies at most size-1 bytes to dest, always adds a
terminating null byte, and does not pad the target with (further) null bytes. This function
fixes some of the problems of strcpy() and strncpy(), but the caller must still handle the possi‐
bility of data loss if size is too small. The return value of the function is the length of src,
which allows truncation to be easily detected: if the return value is greater than or equal to
size, truncation occurred. If loss of data matters, the caller must either check the arguments
before the call, or test the function return value. strlcpy() is not present in glibc and is not
standardized by POSIX, but is available on Linux via the libbsd library.
I'm not sure which specific excerpt you're referring to, but I have a good idea of the many functions that libraries have come up with to sling characters from one buffer to another, plus I read your implementation and the man page snippet you linked above. I'm still not seeing why you can't replace the code between lines 881 and 902 with one of the appropriate copying routines; you quite literally have a source, destination, and length and you can fix up the last NUL byte right after the call. The standard library's function will be vectorized regardless of how your compiler was feeling that day, and it's probably smarter than yours (glibc, for example, does a "small copy" up to alignment before it launches into the vectorized stuff, rather than skipping it entirely if the buffers aren't aligned). And your function does have undefined behavior: you pun a char * to a ulong *.
> The standard library's function will be vectorized regardless of how your compiler was feeling that day
We talked glibc. I mentioned there are libraries which do not do _any_ optimization other than a byte by byte copy.
> I'm still not seeing why you can't replace the code between lines 881 and 902
Because you are considering only glibc.
And yes, we can do a lot of things. But the function and copying buffers are not the top priority for us ATM. I shared it as an example in the context of the current topic. Not all code is supposed to match your preferences.
> char * to a ulong *
Both the source and destination are guarded by length and alignment requirement checks.
> Not all code is supposed to match your preferences.
Yikes, sorry if I came off as trying to force my opinion on your project. I'm just trying to understand the rationale behind the choices you made, since I've (clearly) never seen anything like it. (If I was genuinely interested in trying to modify your project to my desires, I hope you can believe I'd be kind enough to dig through the project to see if I could figure this out myself, then send a patch with rationale for you to decide whether you wanted it or not, rather than yell at you on Hacker News to fix it.) But to your points:
> We talked glibc. I mentioned there are libraries which do not do _any_ optimization other than a byte by byte copy.
I haven't actually seen one for quite a while–most of the libcs that I'm familiar with (glibc, macOS's libc, musl, libroot, the various BSD libcs, Bionic) have some sort of vectorized code. I'm curious if the project can run on some obscure system that I'm not considering ;)
> Both the source and destination guarded by length and alignment requirement checks.
Perhaps we have a misunderstanding here: I'm saying it's undefined by the C standard, as the pointer cast is a strict aliasing violation regardless of the checks. It will generally compile correctly as char * can alias any type, so the compiler will probably be unable to find the undefined behavior, but it's technically illegal. (I would assume this is one of the many reasons most libcs implement their string routines in assembly.)
> send a patch with rationale for you to decide whether you wanted it or not
I would really appreciate that. And I do understand your intention is good.
The problem I see with geeky forums is there are just too many people trying to force their ideas on you at every step and expect you to implement those. So it's kind of a standard reply from my side.
It’s a bit too late for me to be writing string manipulation code in C, but I’ll see if I can take a look at this tomorrow. It’ll probably just be a replacement of the copying part with memcpy, and a benchmark if I can find one.
Yeah, musl just vectorizes mildly using when certain GNU C extensions are available. Presumably Rich didn’t want to write out another version in assembly. (It really is a shame that strncpy returns dest.)
I try not to include C or C++ projects other than for educational purpose (like the Mandelbrot set) because one of my life's goal is to help the world to transition to a C & C++ free world (other than for kernels...).
I believe that my role is to promote projects which are "building the new world" and thus we need to abandon and port all form insecure core.
So in an article about high/extreme performance systems, you're ignoring the vast majority of them because you don't agree with the tool used to achieve said performance? What..?
Unfortunately they surely do, because a large set of developers writes C++ code full of C idioms.
Which is why Google has thrown out the towel and Android 11 will require hardware memory tagging for native code, and now everything is compiled with FORTIFY enabled.
It was always higher performance than e.g. Pascal or Basic on any relevant platform (the cost was lack of error checking, e.g. array bounds).
And it was slower than FORTRAN on most 32-but platforms such as DEC, Sun and IBM Unix workstations, VAXen and mainframes - but it was still the speed king on the most prevalent platform of the time, 8086/80286 and friends.
Only as urban myth scattered around by the C crowd.
As user from all Borland product until they changed to Inprise, it was definitely not the case. Pascal and Basic compilers provided enough customization points.
When one of them wasn't fast enough versus Assembly, none of them were.
I used to have fun showing C dudes in demoscene parties how to optimize code.
Now, if you are speaking about the dying days of MS-DOS, when everyone was jumping into 32 bit extenders with Watcom C++, then we are already in another chapter of 16 bit compiler history.
I used TP from 3.0 to 7.0 and a little bit of Delphi 1, and contemporary Turbo C; I dropped to assembly often, dropped TP bound checking often, and was well aware of all these controls.
Parsing with a *ptr++ in TC was not matched by TP until IIRC v7; 16 bit watcom often produced way better code than either TP or TC.
And, as you say, indeed when speed was really needed, you dropped to assembly; no compiler at the time would properly generate “lodsb” inside a loop, although watcom did in its late win3 target days IIRC.
I cannot say I ever bother to benchmark parsing algorithms across languages in MS-DOS, so maybe that was a case where Turbo C might have won a couple of micro-benchmarks.
That was just an example. In general, properly written TP code (properly configured) was on par with properly written TC code, and both were slower than properly written Watcom code in my experience - I did them all and switched frequently.
Parsing was one example where C shone above Pascal, and there were others. My experience was Watcom was consistently better, but in general C was sometimes easier/faster, Pascal was rarely easier/faster, and if speed mattered ASM was the only way.
Well, as I mentioned in several comments, in what concerns my part of the world, in a time and age where a BBS was the best we could get for going online, Watcom did not even exist on my radar until MS-DOS 32bit extenders were relevant.
So we are forgetting here the complete 8 bit generation, and 3/4 of MS-DOS lifetime.
During 8 bit days, all games that mattered were written in Assembly.
During the 16 bit days, Pascal, Basic, C, Modula-2, AMOS were the "Unity" of early 90's game developers, with serious games still being written in Assembly.
The switch to C occurred much later at the end of MS-DOS lifetime, when 386 and 486 were widespread enough, thanks success like Doom and Abrash books.
Easy to check from pouet, hugi, breakout, Assembly or GDC postmortem archives.
The person you replied to said that C was the language of choice for speed and not rivaled by pascal or basic. What games were written in pascal or basic and known to be competitive with other high end games of the time?
"Allen: Oh, it was quite a while ago. I kind of stopped when C came out. That was a big blow. We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization...The nubbin of the debate was Steve's defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer's issue....
Seibel: Do you think C is a reasonable language if they had restricted its use to operating-system kernels?
Allen: Oh, yeah. That would have been fine. And, in fact, you need to have something like that, something where experts can really fine-tune without big bottlenecks because those are key problems to solve. By 1960, we had a long list of amazing languages: Lisp, APL, Fortran, COBOL, Algol 60. These are higher-level than C. We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities."
-- Excerpted from: Peter Seibel. Coders at Work: Reflections on the Craft of Programming
Back to your games list,
Most strategy games from SSI used compiled Basic and Pascal based engines. Only at the very end did they switch to C / C++.
Apogee has written several games in Turbo Pascal.
The games released by Oliver Twins on the BBC Micro, using a mix of Basic and Assembly. Which then eventually found Blitz Games Studios.
If one considers OS having the same performance requirements as games, Apple's Lisa and Mac OSes, written in a mix of Object Pascal and Assembly.
Also related to games, Adobe Photoshop was initially written in Pascal before going cross platform.
EDIT: Forgot to add some demos as well,
Demos from Denthor, tpolm.
Anything from Triton and first games from their Starbreeze studio.
> C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities.
How is C to blame for universities not teaching compilers?
You didn't list any actual game titles, just game makers.
Also quoting someone saying that C destroyed the ability to make compiler optimizations is a little strange when that has been at the core of most software for decades. It's bizarre how much you try to argue about things with mountains of evidence to the contrary.
While I don't necessarily agree with his claims, it is true that there's a huge gap of about 10-15 years between when FORTRAN compilers did some optimizations and when C compilers were able to do them (and only if you properly annotated things with __restrict, etc). I used FORTRAN77 compilers in the early 1990s that did vectorization / pipelining of the kind C compilers started doing in the last decade.
The main reason, though, is that in FORTRAN, the aliasing rules allow the compiler to assume basically anything, whereas C has sequence points and a (super weak) memory model which don't. But I wouldn't say it is C's fault.
Apparently going into the history of the games produced by those game makers is asking too much.
That someone has done more for improving the computing world than either of us ever will.
See that is is the thing with online forums, I tell my point of view and personal lifetime experience, someone like you will dismiss it, then I reply, you dimiss it again as not fitting your view of the world, ask for yet another set of whatever stuff, and I will just watch Netflix for what I care, as I have better things to do with my life than win online discussions.
In my opinion, we should instead focus on hardware and experiment more with different kinds of cpus, memory, co-processors etc. The key to newer software systems are newer kinds of hardware, for which you can write newer experimental systems in the language of your own designs.
The sky is the limit, and there is so much to do! Transactional memory, massively multicore computers, hardware built on predicate logic, neuromorphic computers, and whatnot.
We are still mostly stuck with the cpu and memory designs of old.
I have not doubts that secure software can be written is C, but it's not the norm and it's too easy to introduce vulnerabilities in C for mere mortals.
A cool feature I recently learned about of matplotlib is that it supports LaTeX for text rendering [1]. You can go as far as rendering LaTeX math formatting for titles/labels, or just have the plot fonts match your text and/or figure captions so it fits nicely into your paper.
I've recently started using this option in gnuplot using the epslatex terminal [1]. Makes for very attractive plots and is relatively simple to use. For those looking for a Matplotlib alternative, I highly recommend it.
This is really cool! Sorry if this is kind of a dumb question but I know nothing about web development: can I use this without running node on the server? I'd like to use this for a WebGL distance field renderer I'm working on but host things on my Github page and thus can only do client side stuff. Getting a server is a possibility but for just one tiny project it doesn't seem worth it.
Yes, you can absolutely run React components without the use of a server. You can edit an html file locally and put a react component in there. That's usually the first thing most React tutorials have you do.
If you're using a backend server you're just rendering the react stuff to html and sending it to the client.
Yeah, you totally can! It's all front-end with React and Webpack. You can use Webpack in development and just have it build the static html and js files for gh-pages. Thats how the documentation site is built :)
You're right, these images are rendered with path tracing (I do also have a Whitted integrator). However tray_rust is a ray tracer in the general sense where it refers to a family of methods that all involve tracing rays to render an image, eg. Whitted recursive ray tracing, Path tracing, Bidirectional path tracing, Photon mapping, Metropolis light transport, Vertex Connection and Merging, etc. (since I hope to implement some of these methods as well!).