I think the author downplays the significance his work because it only applies to "small neural networks". There are a lot of use-cases that can benefit from this type of optimizations. Discovering how to use an undocumented fast accelerator available on millions of devices is very valuable.
I get the value of the common APIs, but as a developer how do you deal with the wide range of performance in different form factors and product generations? Is there some way to gracefully adapt the same models to a specific device’s capabilities?
There are a bunch of easy ways to scale neural nets. Quantization and distillation being the main approaches (or some combination of the two). Both typically require more training time, but not much more human-effort.
You can normally expect to get way more than half the 'outcome' from a neural net with half the ram/compute/time/power budget. So neural nets scale 'down' pretty well.
I think you're right and you're wrong, it's a bit more complicated.
ML is one of the few applications that benefit from platform-specific optimizations, so if you need every ounce of performance, you have your choice of which walled garden you want to tether your application to. The "lock-in" comes from the specific capabilities of your special-purpose hardware, and for serious applications, you're already thinking hard about whether to design your entire implementation around Apple, NVidia, Google/TPU, or even Android devices. For big models, platform-specific needs influence every aspect of model design, including data/model sharding, quantization, training loops...
For non-scientific applications, it's usual practice to train your model in platform-agnostic ways using PyTorch or Tensorflow or whatever and then deploy it to devices in platform-specific ways, whether that's XLA, CoreML, Edge TPU, Android NNAPI, TensorflowJS, or hell, custom-written GLSL shaders or whatever.
We're just starting to see cross-platform frameworks that abstract model inference: TFLite, PyTorch Mobile, ONNX. To their credit, CoreML can act as a backend for any of these, so you don't even need to worry about your platform.
Every platform is a golden cage in some respect. Ask any business who is stuck on ancient Win32 and even DOS applications, source code long gone. (Looking at you my local McDonalds, Menards, Tractor Supply)…
The worst thing is, their users don’t even seem to be totally happy with the state of affairs! It’s like they don’t even realize their preferences are wrong. :(
That 285 is listed as (2:1 sparse) which means it's only valid for matrices where 2 out of every 4 numbers are zero. For dense matrices it's half that.
Are 2:1 sparse matrices a common thing? It seems weird, like clearly that’s not sparse enough to want to use, like, sparse matrix “CSR” style storage or something, haha. I would just treat it as dense I guess.
They aren't. As far as I can tell, Nvidia does this to be able to double the number of TFlops they put on their website. (this might be a little unfair, the real reason is that in ML it might be possible to train a NN such that your matrices have this structure, but I haven't seen anyone other than Nvidia use it)
What you might do is train using dense matrices, then sparsify those (pick the 2 out of each set of 4 weights that are closest to zero, mask them out), then do a few more training iterations with the mask in place.
It turns out that even without the extra training iterations you often lose surprisingly little in terms of quality of output. In reality you can sparsify a lot more, but 2 out of 4 is so simple and easy to implement in hardware, more complex schemes are much harder to justify.
However, small matmuls (say, <2048 bytes in the K dimension) won't get anywhere near 2x performance.
I’m trying to think of cases where it might accidentally come up, and all I can think of is something like “oops I used complex but my values are actually real.”
The 1.5 here is for a single core, though. So if we assume that the performance core on an M1 is around 7.5 watts (I’m not actually sure, seems like a reasonable upper bound though if a whole M1 mini is around 39 watts), we’d be looking at around 750 watts to match. Which seems like a surprisingly non-crazy amount of power given these are 32 bit flops, unlike the 16 in the RTX 3090, and they come from a CPU.
I tried gemm-benchmark on my M1 Max, and it took 22W to hit 2.2 Tflops with AMX (accelerate) or 36W to hit 270 GFlops with NEON (OpenBLAS)
So that's actually just about as power-efficient for fp32 as a 3090, which according to wikipedia is 35 Tflops in 350W. Supposedly AMX can do 2x rate for fp16 as opposed to the 3090's 4x rate, so maybe 2x less efficient than a 3090 for fp16.
Has it been verified that they actually use these instructions in Accelerate.framework? I just benchmarked this on my 2019 intel i9 mbp, and got the following speeds for 128x128 matrices, 32 repeats:
cblas_sgemm: 36 GFLOP/s
vDSP_mmul: 41 GFLOP/s
That's a pretty big deal if these functions are >30x faster on the M1...!
edit: that seems to be verified in the tlkh.dev blog post above. Interestingly, I ran the same code on my bargain-basement 2020 iphone SE, and got 259GFLOP/s! These apple devices are pretty mindblowing.
Has it been verified that they actually use these instructions in Accelerate.framework?
Yes. Aside from benchmarks, you can easily verify this by profiling an application with Instruments and then inspecting the disassembly.
However, it should be said that AMX does not scale linearly with the number of cores, but with the number of core clusters. So, on the M1 if you use Accelerate in two threads (rather than one), performance will barely improve, because the first thread can keep the AMX unit busy enough.
However, e.g. the M1 Pro and M1 Max have two performance core clusters with AMX units in them. So matrix multiplication doubles roughly two times compared to the M1. Similarly, the M1 Ultra has fours performance core clusters, so matrix multiplication performance is roughly twice that of the M1 Pro/Max and four times that of the M1.
Hey not related but you mentioned using kvm to run arm64 macOS on linux aarch64. I would like to give this a shot, but can't find a project for it. Would you mind sharing the deets?
Of course they do, Apple like to remain as much in control as possible. If suddenly it becomes more efficient/faster to run ML/AI stuff on Asahi Linux on Mac hardware then with macOS, I'm sure they be embarrassed enough to take some sort of action. And I'm pretty sure that action will be towards the side of "closing things down" rather than "opening stuff up", as is tradition.