There's no need to speculate, this stuff is all in Intel's manuals. All shuffles... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		stephencanon on March 13, 2016 \| parent \| context \| favorite \| on: AVX2 faster than native popcnt instruction on Hasw... There's no need to speculate, this stuff is all in Intel's manuals. All shuffles are port 5 on Haswell, and they're single-cycle latency / single-cycle throughput. Actually, Nehalem - Ivy Bridge could execute two PSHUFBs per cycle; Haswell reduced the throughput while adding a 256b-wide version, so the total work per cycle remains constant if you adopt the new instruction. The SSSE3 popcount implementation was never bottlenecked on PSHUFB[1]. The speedup is because Haswell is a physically wider machine (it has more execution ports) and can execute more uops each cycle. [1] Except on Merom, where PSHUFB was cracked to 4 or 5 uops IIRC, but that's a ten year old part now.

stephencanon on March 13, 2016 [–]

Too late to edit, but I mangled the last sentence of this comment; it should instead be something like "The speedup is because Haswell has wider vector instructions and more execution ports (it can execute more uops each cycle)."

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact