I think it is important to note that while double-pumped, using 512-bit register...

adrian_b · on Sept 26, 2022

It should also be noted that believing that Zen 4 is "double-pumped" and the Intel CPUs are not "double-pumped" is completely misleading.

On most Intel CPUs with AVX-512 support, there are 2 classes of 512-bit instructions: instructions executed by combining a pair of 256-bit units, thus having an equal throughput for 512-bit instructions and 256-bit instructions, and the second class of instructions, which are executed by combining a pair of 256-bit execution units and also by extending to 512 bits another 256-bit execution unit.

For the second class of instructions the Intel CPUs have a throughput of two 512-bit instructions per cycle vs. three 256-bit instructions per cycle.

Compared to the cheaper models of Intel CPUs, Zen 4, while having the same throughput as Zen 3, i.e. two 512-bit instructions per cycle vs. four 256-bit instructions per cycle in Zen 3, either matches or exceeds the throughput of the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA per cycle can be executed.

The only important advantage of Intel appears in the most expensive models of the server and workstation CPUs, i.e. in most Xeon Gold, all Xeon Platinum and all of the Xeon W models that have AVX-512 support.

In these more expensive models, there is a second 512-bit FMA unit, which enables a double FMA throughput compared to Zen 4. These models with double FMA throughput are also helped by a double throughput for the loads from the L1 cache, which is matched to the FMA throughput.

So the AVX-512 implementation in Zen 4 is superior to that in the cheaper CPUs like Tiger Lake, even without taking into account the few new execution units added in Zen 4, like the 512-bit shuffle unit.

Only the Xeon Platinum and the like of the future Sapphire Rapids will have a definitely greater throughput for the floating-point operations than Zen 4, but they will also have a significantly lower all-clock frequency (due to the inferior manufacturing process), so the higher throughput per clock cycle is not certain to overcome the deficit in clock frequency.

BeeOnRope · on Sept 27, 2022

Yes, Intel also takes a less than "full" approach to moving from 256b to 512.

Though I think it is fair to say the Intel implementation represents kind of an intermediate state between the AMD approach (essentially no increase in execution or datapath resources outside of the shuffle) and simply extending everything 2x and a full doubling of every resource.

Essentially on SKX Intel chip behaves as if it had 2 full-width 512-bit execution ports: p01 (via fusion) and p5. For 256b it is three ports. Not all ports can do everything so the comparison is sometimes 3 vs 2 or 2 vs 1, but also sometimes 2 vs 2 (FMA operations on 2-FMA chips come to mind).

Critically, however, the load and store units were also extended to 512 bits: SKX can do 2x loads (1024 bits) and 1x store (512 bits) per cycle. This puts a hard cap on the performance of load and store heavy AVX methods, which does includes some fairly trivial but important integer operation loops like memcpy, memset and memchr type stuff which is fast enough to hit the load or store limits.

sitkack · on Sept 27, 2022

Neat! So does that mean a single thread can hit the membw limit using AVX512 for memcpy operations?

janwas · on Sept 27, 2022

Maxing out memBW requires multiple threads because Intel cores are relatively limited in line fill buffers. I've seen around 12 GB/s per SKX core with AVX-512.

BeeOnRope · on Sept 27, 2022

You usually don't even need AVX512 to sustain enough load/stores at the core to max out memory bandwidth "in theory": even with 256 bit loads and assuming 2/1 loads/stores per cycle (ICL/Zen 3 and newer can do more), that's 256 GB/s of read bandwidth or 128 GB/s write bandwidth (or both, at once!) at 4 GHz.

Indeed, you can reach these numbers if you always hit in L1 and come close if you always hit in L2. The load number especially is higher than almost any single socket bandwidth until perhaps very recently*: an 8-channel chip with the fastest DDR4-3200 would get 25.6 x 8 = 204.8 GB/s max theoretical bandwidth. Most chips have fewer channels and lower max theoretical bandwidth.

However, and as a sibling comment alludes to, you generally cannot in practice sustain enough outstanding misses from a single core to actually achieve this number. E.g., with 16 outstanding misses and 100 ns latency per cache line you can only demand fetch at ~10 GB/s from on core. Actually numbers are higher due to prefetching, which both decreases the latency (since the prefetch is initiated from a component closer to the memory controller) and makes more outstanding misses available (since there are more miss buffers from L2 than there are from the core) but this to only roughly double the bandwidth: it's hard to get more than 20-30 GB/s from a single core on Intel.

This isn't a fundamental limitation which applies to every CPU however: Apple chips can extract the entire bandwidth from a single core, despite having much smaller 128-bit (perhaps 256-bit if you consider load pair) load and store instructions.

---

* Not really sure about this one: are there 16-channel DDR5 setups out there yet (16 DDR5 channels corresponds to 8 independent DIMMS so is similar to an 8-channel DDR4 setup as DDR5 has 2x channels per DIMM)?

celrod · on Sept 26, 2022

Yeah, the claim was that this is why it hit higher clock speeds. The front end will be hard pressed to hit/maintian 4 IPC, while 2 IPC is much easier.