Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Something changed with Haswell that allows this code to run faster than the purpose-designed instruction.

The change is primarily that AVX2 (which Haswell is the first generation to support) extended binary and integer vector operations to 32B registers, while AVX only supported 32B floating point operations. Instructions that previously operated on 16B doubled their throughput by handling 32B with (usually) the same latency: https://software.intel.com/sites/landingpage/IntrinsicsGuide...



I thought Agner had shown that the Haswell AVX2 stuff didn't necessarily bother turning on the entire execution unit unless it really seemed warranted, preferring instead to issue the 128-bit operation twice and combine. For example see the later comments on http://www.agner.org/optimize/blog/read.php?i=142


There's a real effect here, but in practice I haven't found it to be an issue. As long as your instruction mix has at least 1 256-bit operation in the last few million instructions, the slowdown doesn't happen. I'm sure you could construct a case where it would be a problem, but throwing in an occasional unused VPXOR solves it easily enough.

One thing that can be an issue is alignment. Unless your reads are 32B aligned, you will be limited to reading 40B per cycle. While a single unaligned vector read per cycle isn't a problem, full utilization of the increased throughput that 'stephencanon' mentions in the sibling is only possible if both vectors are 32B aligned: http://www.agner.org/optimize/blog/read.php?i=415#423


The other critical piece was that Haswell doubled load/store throughput to L1 cache, not just register widths. (I know that you know this, just want to make it explicit).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: