I think the author downplays the significance his work because it only applies t...

mark_l_watson · on Jan 5, 2023

Apple has done a wonderful job making CoreML smoothly integrated with iOS, iWatchOS, iPadOS, and macOS development.

brookst · on Jan 5, 2023

I get the value of the common APIs, but as a developer how do you deal with the wide range of performance in different form factors and product generations? Is there some way to gracefully adapt the same models to a specific device’s capabilities?

londons_explore · on Jan 5, 2023

There are a bunch of easy ways to scale neural nets. Quantization and distillation being the main approaches (or some combination of the two). Both typically require more training time, but not much more human-effort.

You can normally expect to get way more than half the 'outcome' from a neural net with half the ram/compute/time/power budget. So neural nets scale 'down' pretty well.

ur-whale · on Jan 5, 2023

> Apple has done a wonderful job making CoreML

Apple has done a wonderful job of further locking their user into the golden cage they call a platform.

gcr · on Jan 5, 2023

I think you're right and you're wrong, it's a bit more complicated.

ML is one of the few applications that benefit from platform-specific optimizations, so if you need every ounce of performance, you have your choice of which walled garden you want to tether your application to. The "lock-in" comes from the specific capabilities of your special-purpose hardware, and for serious applications, you're already thinking hard about whether to design your entire implementation around Apple, NVidia, Google/TPU, or even Android devices. For big models, platform-specific needs influence every aspect of model design, including data/model sharding, quantization, training loops...

For non-scientific applications, it's usual practice to train your model in platform-agnostic ways using PyTorch or Tensorflow or whatever and then deploy it to devices in platform-specific ways, whether that's XLA, CoreML, Edge TPU, Android NNAPI, TensorflowJS, or hell, custom-written GLSL shaders or whatever.

We're just starting to see cross-platform frameworks that abstract model inference: TFLite, PyTorch Mobile, ONNX. To their credit, CoreML can act as a backend for any of these, so you don't even need to worry about your platform.

gjsman-1000 · on Jan 5, 2023

Every platform is a golden cage in some respect. Ask any business who is stuck on ancient Win32 and even DOS applications, source code long gone. (Looking at you my local McDonalds, Menards, Tractor Supply)…

bee_rider · on Jan 5, 2023

The worst thing is, their users don’t even seem to be totally happy with the state of affairs! It’s like they don’t even realize their preferences are wrong. :(

bee_rider · on Jan 5, 2023

This was intended to be obvious sarcasm, but I somehow accidentally added “don’t” which… really just makes it confusing. Oops, haha.

MuffinFlavored · on Jan 5, 2023

Not up to date on a lot of "AI"/"ML" things, why isn't this significant for medium/large neural networks as well?

lostmsu · on Jan 5, 2023

RTX 3090 theoretical matmul is 142 TFlops. E.g. about 100x of this.

johndough · on Jan 5, 2023

The RTX 3090 has 35.58 TFlops FP32 performance, or 285.48 FP16 according to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

EDIT: I fell for NVIDIA's marketing. The dense FP16 performance is only half of 284.48, which is 142. Thanks to adgjlsfhk1 for the correction.

adgjlsfhk1 · on Jan 5, 2023

That 285 is listed as (2:1 sparse) which means it's only valid for matrices where 2 out of every 4 numbers are zero. For dense matrices it's half that.

bee_rider · on Jan 5, 2023

Are 2:1 sparse matrices a common thing? It seems weird, like clearly that’s not sparse enough to want to use, like, sparse matrix “CSR” style storage or something, haha. I would just treat it as dense I guess.

adgjlsfhk1 · on Jan 5, 2023

They aren't. As far as I can tell, Nvidia does this to be able to double the number of TFlops they put on their website. (this might be a little unfair, the real reason is that in ML it might be possible to train a NN such that your matrices have this structure, but I haven't seen anyone other than Nvidia use it)

Firadeoclus · on Jan 6, 2023

What you might do is train using dense matrices, then sparsify those (pick the 2 out of each set of 4 weights that are closest to zero, mask them out), then do a few more training iterations with the mask in place.

It turns out that even without the extra training iterations you often lose surprisingly little in terms of quality of output. In reality you can sparsify a lot more, but 2 out of 4 is so simple and easy to implement in hardware, more complex schemes are much harder to justify.

However, small matmuls (say, <2048 bytes in the K dimension) won't get anywhere near 2x performance.

bee_rider · on Jan 5, 2023

I’m trying to think of cases where it might accidentally come up, and all I can think of is something like “oops I used complex but my values are actually real.”

dotnet00 · on Jan 5, 2023

There has been some work in that direction but it hasn't really caught on as fast as NVIDIA may have expected it to.

lostmsu · on Jan 5, 2023

Yeah, still waiting for this feature to be available in PyTorch natively.

bee_rider · on Jan 5, 2023

The 1.5 here is for a single core, though. So if we assume that the performance core on an M1 is around 7.5 watts (I’m not actually sure, seems like a reasonable upper bound though if a whole M1 mini is around 39 watts), we’d be looking at around 750 watts to match. Which seems like a surprisingly non-crazy amount of power given these are 32 bit flops, unlike the 16 in the RTX 3090, and they come from a CPU.

lostmsu · on Jan 5, 2023

This code runs on AMX co-processor. From the article:

> An important distinction is that the AMX:CPU ratio is not 1:1; not every core has its own AMX co-processor.

My understanding is there's only 1 of those per regular M1 CPU, maybe 4 on the largest one (Ultra).

brigade · on Jan 5, 2023

I tried gemm-benchmark on my M1 Max, and it took 22W to hit 2.2 Tflops with AMX (accelerate) or 36W to hit 270 GFlops with NEON (OpenBLAS)

So that's actually just about as power-efficient for fp32 as a 3090, which according to wikipedia is 35 Tflops in 350W. Supposedly AMX can do 2x rate for fp16 as opposed to the 3090's 4x rate, so maybe 2x less efficient than a 3090 for fp16.

Interestingly, fp64 hits 370 Gflops at 15W...

my123 · on Jan 5, 2023

Apple did prefer to expose it through their own Accelerator.framework API however...

svantana · on Jan 5, 2023

Has it been verified that they actually use these instructions in Accelerate.framework? I just benchmarked this on my 2019 intel i9 mbp, and got the following speeds for 128x128 matrices, 32 repeats:

  cblas_sgemm: 36 GFLOP/s
  vDSP_mmul: 41 GFLOP/s

That's a pretty big deal if these functions are >30x faster on the M1...!

edit: that seems to be verified in the tlkh.dev blog post above. Interestingly, I ran the same code on my bargain-basement 2020 iphone SE, and got 259GFLOP/s! These apple devices are pretty mindblowing.

microtonal · on Jan 5, 2023

Has it been verified that they actually use these instructions in Accelerate.framework?

Yes. Aside from benchmarks, you can easily verify this by profiling an application with Instruments and then inspecting the disassembly.

However, it should be said that AMX does not scale linearly with the number of cores, but with the number of core clusters. So, on the M1 if you use Accelerate in two threads (rather than one), performance will barely improve, because the first thread can keep the AMX unit busy enough.

However, e.g. the M1 Pro and M1 Max have two performance core clusters with AMX units in them. So matrix multiplication doubles roughly two times compared to the M1. Similarly, the M1 Ultra has fours performance core clusters, so matrix multiplication performance is roughly twice that of the M1 Pro/Max and four times that of the M1.

Benchmarks:

https://github.com/danieldk/gemm-benchmark#1-to-16-threads

discord · on Jan 8, 2023

Hey not related but you mentioned using kvm to run arm64 macOS on linux aarch64. I would like to give this a shot, but can't find a project for it. Would you mind sharing the deets?

capableweb · on Jan 5, 2023

Of course they do, Apple like to remain as much in control as possible. If suddenly it becomes more efficient/faster to run ML/AI stuff on Asahi Linux on Mac hardware then with macOS, I'm sure they be embarrassed enough to take some sort of action. And I'm pretty sure that action will be towards the side of "closing things down" rather than "opening stuff up", as is tradition.

my123 · on Jan 5, 2023

Wrong answer.

AMX is an unstable ISA that changes between product generations. That's why it's not publicly documented.

Arm SME is the standardisation of the concept, but is not inmarket yet.

https://community.arm.com/arm-community-blogs/b/architecture...