> Speculatively, AVX512 processes multiplies serially, one 256-bit lane at a tim...

celrod · on May 12, 2021

p5 can do 512 bit operations, but not 256 bit, e.g. look at Skylake-AVX512 and Cascadelake (Xeon benched in the blog post was Cascadelake) ports for vaddpd:

https://uops.info/html-instr/VADDPD_YMM_YMM_YMM.html

Here is 256 bit VPMULUDQ: https://uops.info/html-instr/VPMULUDQ_YMM_YMM_YMM.html

Here is 512 bit VPMULUDQ: https://uops.info/html-instr/VPMULUDQ_ZMM_ZMM_ZMM.html

The 256 bit and 512 bit versions both have a reciprocal throughput of 0.5 cycles/op, using p01 for 256 bit and p05 for 512 bit (where, as you note p0 for 512 bit really means both 0 and 1).

So, given the same clock speed, this multiplication should have twice the throughput with 512 bit vectors as with 256 bit. This isn't true for those CPUs without p5, like icelake-client, tigerlake, and rocketlake. But should be true for the Xeon ridiculousfish benchmarked on.