There's no need to speculate, this stuff is all in Intel's manuals. All shuffles are port 5 on Haswell, and they're single-cycle latency / single-cycle throughput. Actually, Nehalem - Ivy Bridge could execute two PSHUFBs per cycle; Haswell reduced the throughput while adding a 256b-wide version, so the total work per cycle remains constant if you adopt the new instruction.
The SSSE3 popcount implementation was never bottlenecked on PSHUFB[1]. The speedup is because Haswell is a physically wider machine (it has more execution ports) and can execute more uops each cycle.
[1] Except on Merom, where PSHUFB was cracked to 4 or 5 uops IIRC, but that's a ten year old part now.
Too late to edit, but I mangled the last sentence of this comment; it should instead be something like "The speedup is because Haswell has wider vector instructions and more execution ports (it can execute more uops each cycle)."
The SSSE3 popcount implementation was never bottlenecked on PSHUFB[1]. The speedup is because Haswell is a physically wider machine (it has more execution ports) and can execute more uops each cycle.
[1] Except on Merom, where PSHUFB was cracked to 4 or 5 uops IIRC, but that's a ten year old part now.