Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's amazing to me that there are four separate pieces of hardware in M1 devices that can do matrix multiplies.

In addition to running on the CPU, M1 Max devices have three separate kinds of hardware-accelerated `gemm`: the GPU, the ANE (Apple Neural Engine), and this special matrix coprocessor. Here's a fairly detailed post that benchmarks each:

https://tlkh.dev/benchmarking-the-apple-m1-max

And here's a great post about the justification for having so much special-purpose hardware:

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

As for the matrix coprocessor, Apple's built-in BLAS implementation (Accelerate.framework) uses this chip. You can link Numpy against this to benefit in your Python programs, for example. Here are some old instructions: https://gist.github.com/MarkDana/a9481b8134cf38a556cf23e1e81...

All this represents yet another cycle on the Wheel of Reincarnation... (http://catb.org/jargon/html/W/wheel-of-reincarnation.html)



Isn't this wheel of reincarnation simply a result of a shifting bottleneck? A computation can be CPU-bound or memory-bound, and this can change over hardware generations.


Makes sense... We're also seeing energy efficiency and model size and latency becoming significant constraints these days, and the more unique constraints an application has, perhaps the more beneficial it is to have many different implementations with different tradeoffs.


> energy efficiency (...) many different implementations

Yep, thermal throttling is a thing, and sometimes all you need is either useless silicon padding or some specialized, most of the time dark silicon to both make it feasible to cool and prevent it from melting.


I suspect Apple was more worried about battery use in this case.


It is, but the fact that the bottleneck has shifted multiple times (as opposed to just this one recent time) is nonobvious (to someone unfamiliar with computing history) and worthy of pointing out.


Since there is no summary, these are the benchmark findings:

    AMX co-processor 2 TFLOPS FP32
    GPU 8 TFLOPS FP32
    Neural Engine 5.5 TFLOPS FP16


Note that AMX can achieve roughly double the FLOPS with FP16, and 8 TFLOPS for the GPU is only about 77% of peak. You can do better than that, especially using FP16 90+% is possible (which is >9.4 TFLOPS).


Is there any easy way to use all of these at the same time? Ie. some library you can ask to do a big matrix multiply and it will loadbalance between the bits of hardware?

Or do you have to manually split the computation between them?


I’m by no means an expert in any of this. I mainly work on video processing using the GPU. That said, I would think if any library would do load balancing between them, it would likely be the Accelerate.framework that ships with the system.

However, I do have some experience with having the same code run on the GPU and the CPU. In my work, we have tried breaking images (usually frames of video) into various sized chunks and processing on both the CPU and GPU at the same time. Our conclusion is that the overhead of using both outweighs any benefit you’d get. The GPU is so much faster than the CPU, there’s no point in involving the CPU at all. These experiments were done several years ago, so perhaps the landscape has changed since then, but that was what we found.


You might find David Wright's presentations about Unreal 5 interesting:

https://highperformancegraphics.org/slides22/Journey_to_Nani...

https://advances.realtimerendering.com/s2022/index.html#Lume...

They're great presentations with a lot of depth in the notes. I think videos are around somewhere if you prefer that.

Two specifics I'd mention:

It seems a lot of games now use feedback between frames as a way to tolerate the latency of moving data between CPU and GPU. Eg the CPU will use GPU crunched data from the previous frame as a source for CPU crunching that optimizes what data gets passed to the GPU next.

The other is that fixed functionality is moving into shaders. Unreal 5 uses a mix of hardware rasterization and software rasterization in a shader (and path tracing now as well). There the tradeoff between the two is triangle size in pixels.


Oh wow! Thanks! That looks really cool.


They're great. I dunno if you find 3d interesting vs video, but the section in that nanite presentation where he goes through how he arrived at the LoD clustering design is some of the smartest stuff I've ever seen any developer say, ever. Like John Carmack probably saw this and went "dang, wish I'd thought of that" levels of smart.


So why would you choose to use the Neural Engine rather than the GPU?

Just power efficiency?


That and if you want to use the GPU at the same time.


> And here's a great post about the justification for having so much special-purpose hardware:

Ok but it doesn't actually justify why AMX & ANE both exist. It makes kind of a vague handwavy "well AMX latency is better[1] and that's useful[2]"

1: but not measured, so not actually known, and with a note that it's been called out that the AMX understanding is incorrect, so is this point even still accurate?

2: but not elaborated on in the slightest or any comparison of test workloads

So why do both AMX & ANE exist? CPU team did AMX before the ANE team showed up with something bigger & better? Are they actually used for differing workloads or simultaneously?


> All this represents yet another cycle on the Wheel of Reincarnation...

Isn't this adding new cores directly onto the main chip? That doesn't sound like it fits to me.

And at this point GPUs have been straddling both sides of the divide for decades, depending on the particular device form factor and the necessary power.

The only thing I would actually say has gone through a cycle lately is the crypto accelerator for mac SSDs.


> Isn't this adding new cores directly onto the main chip? That doesn't sound like it fits to me.

These are coprocessors, which are a very different thing from just another CPU core. For one, they use a different architecture (instruction set, registers/memory, etc.).

The "wheel of reincarnation" refers to features/capabilities on coprocessors eventually being folded into the main CPU. While CPUs have adopted insights from GPU implementations, GPU functionality has never been fully folded into CPUs (software rasterizers don't count).


> These are coprocessors, which are a very different thing from just another CPU core.

Well that's why I didn't say "just another CPU core". But fine, I don't want to argue semantics.

> The "wheel of reincarnation" refers to features/capabilities on coprocessors eventually being folded into the main CPU.

Then that's definitely not happening here, and hasn't happened to x86/arm since they gained floating point, right?


There's also the media encoder hardware accelerator, which isn't quite `gemm`, but certainly contains hardware that performs `mm`s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: