Some information from Anandtech's deep dive into Apple's "big" Firestorm core.
>On the Integer side, whose in-flight instructions and renaming physical register file capacity we estimate at around 354 entries, we find at least 7 execution ports for actual arithmetic operations. These include 4 simple ALUs capable of ADD instructions, 2 complex units which feature also MUL (multiply) capabilities, and what appears to be a dedicated integer division unit. The core is able to handle 2 branches per cycle, which I think is enabled by also one or two dedicated branch forwarding ports, but I wasn’t able to 100% confirm the layout of the design here.
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
Vector abilities of the 4 pipelines seem to be identical, with the only instructions that see lower throughput being FP divisions, reciprocals and square-root operations that only have an throughput of 1, on one of the four pipes.
> This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
Reminder that browsers try to avoid using doubles for the Number type, preferring integers with overflow checks. Much of layout uses fixed point for subpixels, too. Using doubles all the time would be a notable perf regression.
Where are you getting that? I thought Intel was at 180 physical integers registers for the same core microarchitecture shared by both desktops and servers.
If you have a source I'm happy to read it but otherwise I think you're confused. Especially about Intel client and server cores having different numbers of registers. The lowest level difference between them I've heard of that wasn't features being fused off is different L3 cache sizes.
One of the reasons Apple does so well in browser tests is that ARM now has instructions to increasing the performance and decreasing the power draw of JavaScript operations.
It’s simply matching the c86 float to int conversion, because JS specifies that behavior in the spec - all this instruction does is even the playing field it isn’t some magic instruction that does more than x86 does.
At a logic level there are no changes to the expensive part of rounding, only changes to the overflow values in the result.
WebKit measured this to be an improvement of less than 2%. So it is certainly "one of the reasons", but certainly not a driving one. (Plus, it's ARMv8.3+.)
JSC didn't even use that instruction when most of the benchmark was done. It has absolutely nothing to do with it, the idea floated or amplified by Grubber / DaringFireball which I believe he never actually goes back to correct it even after the fact was shown.
Much like his idea of $149 AirPods were made close to BOM cost. And that is how the whole world went on to believe all the wrong information.
>On the Integer side, whose in-flight instructions and renaming physical register file capacity we estimate at around 354 entries, we find at least 7 execution ports for actual arithmetic operations. These include 4 simple ALUs capable of ADD instructions, 2 complex units which feature also MUL (multiply) capabilities, and what appears to be a dedicated integer division unit. The core is able to handle 2 branches per cycle, which I think is enabled by also one or two dedicated branch forwarding ports, but I wasn’t able to 100% confirm the layout of the design here.
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
Vector abilities of the 4 pipelines seem to be identical, with the only instructions that see lower throughput being FP divisions, reciprocals and square-root operations that only have an throughput of 1, on one of the four pipes.
https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...