> Speculatively, AVX512 processes multiplies serially, one 256-bit lane at a time, losing half its parallelism.
Sort of, in Skylake AVX512 fuses the 256-bit p0 and p1 together for one 512-bit µop, and p5 becomes 512-bit wide. So theoretically you get 2x 512-bit pipelines versus AVX2's 3x 256-bit pipelines (two of which can do multiplies.)
Unfortunately, p5 doesn't support integer multiplies, even in SKUs where p5 does support 512-bit floating-point multiplies. So AVX512 has no additional throughput for integer multiplies on current implementations.
p5 can do 512 bit operations, but not 256 bit, e.g. look at Skylake-AVX512 and Cascadelake (Xeon benched in the blog post was Cascadelake) ports for vaddpd:
The 256 bit and 512 bit versions both have a reciprocal throughput of 0.5 cycles/op, using p01 for 256 bit and p05 for 512 bit (where, as you note p0 for 512 bit really means both 0 and 1).
So, given the same clock speed, this multiplication should have twice the throughput with 512 bit vectors as with 256 bit.
This isn't true for those CPUs without p5, like icelake-client, tigerlake, and rocketlake. But should be true for the Xeon ridiculousfish benchmarked on.
Sort of, in Skylake AVX512 fuses the 256-bit p0 and p1 together for one 512-bit µop, and p5 becomes 512-bit wide. So theoretically you get 2x 512-bit pipelines versus AVX2's 3x 256-bit pipelines (two of which can do multiplies.)
Unfortunately, p5 doesn't support integer multiplies, even in SKUs where p5 does support 512-bit floating-point multiplies. So AVX512 has no additional throughput for integer multiplies on current implementations.