Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Funny to see a comment on HN raising this exact point, when just ~2 hours ago I was writing inline asm that used `lea` precisely to preserve the carry flag before a jump table! :)




I'm curious, what are you working on that requires writing inline assembly?

I'm not them but whenever I've used it it's been for arch specific features like adding a debug breakpoint, synchronization, using system registers, etc.

Never for performance. If I wanted to hand optimise code I'd be more likely to use SIMD intrinsics, play with C until the compiler does the right thing, or write the entire function in a separate asm file for better highlighting and easier handing of state at ABI boundary rather than mid-function like the carry flags mentioned above.


Generally inline assembly is much easier these days as a) the compiler can see into it and make optimizations b) you don’t have to worry about calling conventions

> the compiler can see into it and make optimizations

Those writing assembler typically/often think/know they can do better than the compiler. That means that isn’t necessarily a good thing.

(Similarly, veltas comment above about “play with C until the compiler does the right thing” is brittle. You don’t even need to change compiler flags to make it suddenly not do the right thing anymore (on the other hand, when compiling for a different version of the CPU architecture, the compiler can fix things, too)


It's rare that I see compiler-generated assembly without obvious drawbacks in it. You don't have to be an expert to spot them. But frequently the compiler also finds improvements I wouldn't have thought of. We're in the centaur-chess moment of compilers.

Generally playing with the C until the compiler does the right thing is slightly brittle in terms of performance but not in terms of functionality. Different compiler flags or a different architecture may give you worse performance, but the code will still work.


Centaur-chess?

https://en.wikipedia.org/wiki/Advanced_chess:

“Advanced chess is a form of chess in which each human player uses a computer chess engine to explore the possible results of candidate moves. With this computer assistance, the human player controls and decides the game.

Also called cyborg chess or centaur chess, advanced chess was introduced for the first time by grandmaster Garry Kasparov, with the aim of bringing together human and computer skills to achieve the following results:

- increasing the level of play to heights never before seen in chess;

- producing blunder-free games with the qualities and the beauty of both perfect tactical play and highly meaningful strategic plans;

- offering the public an overview of the mental processes of strong human chess players and powerful chess computers, and the combination of their forces.”


Ah thank you!

Well I have benchmarks where my hand-written asm (on a fundamental inner function) beat the compiler-generated code by 3× :) Without SIMD (not applicable to what I was trying to solve).

And that was already after copious `assert_unchecked`s to have the compiler assume as many invariants as it could!


> “play with C until the compiler does the right thing” is brittle

It's brittle depending on your methods. If you understand a little about optimizers and give the compiler the hints it needs to do the right things, then that should work with any modern compiler, and is more portable (and easier) than hand-optimizing in assembly straight away.


Well in my case I had to file an issue with the compiler (llvm) to fix the bad codegen. Credit to them, it was lightning fast and they merged a fix within days.

gcc optimised it correctly though.


Of course you can often beat the compiler, humans still vectorize code better. And that interpreter/emulator switch-statement issue I mentioned in the other comment. There are probably a lot of other small niches.

In general case you're right. Modern compilers are beasts.


Might be an interpreter or an emulator. That’s where you often want to preserve registers or flags and have jump tables.

This is one of the remaining cases where the current compilers optimize rather poorly: when you have a tight loop around a huge switch-statement, with each case-statement performing a very small operation on common data.

In that case, a human writing assembler can often beat a compiler with a huge margin.


I'm curious if that's still the case generally after things like musttail attributes to help the compiler emit good assembly for well structured interpreter loops:

https://blog.reverberate.org/2025/02/10/tail-call-updates.ht...


https://github.com/andrepd/posit-rust

LLVM codegen has been almost always sufficient, but for a routine that essentially amounts to adding two fixed-size bigints (e.g. 1024-bit ints represented as `[u64; 16]`), codegen was very very bad.

Writing a jump table by hand literally made the code 3× faster :)


I worked on a C codebase once, integrating an i2c sensor. The vendor only had example code in asm. I had to learn to inline asm.

It still happens in 2025




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: