Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's no need to speculate, this stuff is all in Intel's manuals. All shuffles are port 5 on Haswell, and they're single-cycle latency / single-cycle throughput. Actually, Nehalem - Ivy Bridge could execute two PSHUFBs per cycle; Haswell reduced the throughput while adding a 256b-wide version, so the total work per cycle remains constant if you adopt the new instruction.

The SSSE3 popcount implementation was never bottlenecked on PSHUFB[1]. The speedup is because Haswell is a physically wider machine (it has more execution ports) and can execute more uops each cycle.

[1] Except on Merom, where PSHUFB was cracked to 4 or 5 uops IIRC, but that's a ten year old part now.



Too late to edit, but I mangled the last sentence of this comment; it should instead be something like "The speedup is because Haswell has wider vector instructions and more execution ports (it can execute more uops each cycle)."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: