*Something changed with Haswell that allows this code to run faster than the pur...

thrownaway2424 · on March 13, 2016

I thought Agner had shown that the Haswell AVX2 stuff didn't necessarily bother turning on the entire execution unit unless it really seemed warranted, preferring instead to issue the 128-bit operation twice and combine. For example see the later comments on http://www.agner.org/optimize/blog/read.php?i=142

nkurz · on March 13, 2016

There's a real effect here, but in practice I haven't found it to be an issue. As long as your instruction mix has at least 1 256-bit operation in the last few million instructions, the slowdown doesn't happen. I'm sure you could construct a case where it would be a problem, but throwing in an occasional unused VPXOR solves it easily enough.

One thing that can be an issue is alignment. Unless your reads are 32B aligned, you will be limited to reading 40B per cycle. While a single unaligned vector read per cycle isn't a problem, full utilization of the increased throughput that 'stephencanon' mentions in the sibling is only possible if both vectors are 32B aligned: http://www.agner.org/optimize/blog/read.php?i=415#423

stephencanon · on March 13, 2016

The other critical piece was that Haswell doubled load/store throughput to L1 cache, not just register widths. (I know that you know this, just want to make it explicit).