Depending on how low-level your code is, this... may not work out in those terms.
In other words, I’d say that if you actually want good software—and that includes making sure its speed falls within a reasonable factor of the napkin-math theoretical maximum achievable on the platform—your three steps can easily constitute three entire rewrites or at least substantial refactors. You might well need to rearchitect if the “working well” version has multiple small loops split by domain-level concern when the hardware really wants a single large one, or if you’re doing a lot of pointer-chasing and need to flatten the whole thing into a single buffer in preorder, or if your interface assumes per-byte ops where SIMD can be applied.
This is not a condemnation of the strategy, mind you. Crap code is valuable and I wish I were better at it. I just disagree that the transition from step 2 to step 3 can be described as an optimization pass. If that’s what you limit yourself to, you’ll quite likely be forced to leave at least an order of magnitude’s worth of performance on the table.
And yes, most consumer software is very much not good by that definition.
(For instance, I’m expecting that the Ladybird devs will be able to get their browser to work well for daily tasks—which I would count a tremendous achievement—but I’m not optimistic about it then becoming any faster than the state of the art even ten or fifteen years ago.)
Some optimization problems require an entire PHD dissertation and research budget to actually optimize, so some algorithms require far more effort applied to this than is reasonable for most products. As mentioned, sometimes you can combine these all into one step -- when you know the domains well.
Sometimes, it might even be completely separate people working on each step... separated by time and space.
In any case, most software generally stops at (2) simply due to the fact that any effort towards (3) isn't worth the effort -- for example, there's very little point in spending two weeks optimizing a report generation that runs in the middle of the night, once a month. At some point, there may be, but usually not anytime soon.
Depending on how low-level your code is, this... may not work out in those terms.
In other words, I’d say that if you actually want good software—and that includes making sure its speed falls within a reasonable factor of the napkin-math theoretical maximum achievable on the platform—your three steps can easily constitute three entire rewrites or at least substantial refactors. You might well need to rearchitect if the “working well” version has multiple small loops split by domain-level concern when the hardware really wants a single large one, or if you’re doing a lot of pointer-chasing and need to flatten the whole thing into a single buffer in preorder, or if your interface assumes per-byte ops where SIMD can be applied.
This is not a condemnation of the strategy, mind you. Crap code is valuable and I wish I were better at it. I just disagree that the transition from step 2 to step 3 can be described as an optimization pass. If that’s what you limit yourself to, you’ll quite likely be forced to leave at least an order of magnitude’s worth of performance on the table.
And yes, most consumer software is very much not good by that definition.
(For instance, I’m expecting that the Ladybird devs will be able to get their browser to work well for daily tasks—which I would count a tremendous achievement—but I’m not optimistic about it then becoming any faster than the state of the art even ten or fifteen years ago.)