Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some information from Anandtech's deep dive into Apple's "big" Firestorm core.

>On the Integer side, whose in-flight instructions and renaming physical register file capacity we estimate at around 354 entries, we find at least 7 execution ports for actual arithmetic operations. These include 4 simple ALUs capable of ADD instructions, 2 complex units which feature also MUL (multiply) capabilities, and what appears to be a dedicated integer division unit. The core is able to handle 2 branches per cycle, which I think is enabled by also one or two dedicated branch forwarding ports, but I wasn’t able to 100% confirm the layout of the design here.

On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

Vector abilities of the 4 pipelines seem to be identical, with the only instructions that see lower throughput being FP divisions, reciprocals and square-root operations that only have an throughput of 1, on one of the four pipes.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...



> This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

Reminder that browsers try to avoid using doubles for the Number type, preferring integers with overflow checks. Much of layout uses fixed point for subpixels, too. Using doubles all the time would be a notable perf regression.


> physical register file capacity we estimate at around 354 entries

That's actually less than most desktop CPUs these days, and much less than Xeons.


Where are you getting that? I thought Intel was at 180 physical integers registers for the same core microarchitecture shared by both desktops and servers.


The number of "hidden" registers for register renaming is few times that number.


The last I heard about the number of physical integer registers changing at Intel was the increase from 168 to 180 with Skylake.

https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

If you have a source I'm happy to read it but otherwise I think you're confused. Especially about Intel client and server cores having different numbers of registers. The lowest level difference between them I've heard of that wasn't features being fused off is different L3 cache sizes.


Yeah, all the public documents I've seen say the Sunny Cove PRF has no change from Skylake, so 180 INT registers and 168 FP.


I think I am confused.

I do remember I heard that physical register file was around 500 registers, but I believe my memory fails me now.


One of the reasons Apple does so well in browser tests is that ARM now has instructions to increasing the performance and decreasing the power draw of JavaScript operations.


Well, it has one: FJCVTZS Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero.


It’s simply matching the c86 float to int conversion, because JS specifies that behavior in the spec - all this instruction does is even the playing field it isn’t some magic instruction that does more than x86 does.

At a logic level there are no changes to the expensive part of rounding, only changes to the overflow values in the result.


I thought you were kidding and then I looked it up: https://developer.arm.com/documentation/dui0801/latest/A64-F...

Seems kind of gross to me to have such a language specific instruction to be honest.


It’s actually “convert float to int the way x86 does it because js specified that behavior”


WebKit measured this to be an improvement of less than 2%. So it is certainly "one of the reasons", but certainly not a driving one. (Plus, it's ARMv8.3+.)


And to be clear that performance win comes from removing the branches that are otherwise needed to provide x86 semantics


JSC didn't even use that instruction when most of the benchmark was done. It has absolutely nothing to do with it, the idea floated or amplified by Grubber / DaringFireball which I believe he never actually goes back to correct it even after the fact was shown.

Much like his idea of $149 AirPods were made close to BOM cost. And that is how the whole world went on to believe all the wrong information.


That is one instruction and a more accurate definition of it would be “match x86 float to integer conversion”.

It’s not a complex instruction, essentially using an explicit set of non default rounding flags. All in order to match what x86 does.

So if that instruction does help arm, it is only in getting rid of an advantage x86 had in being the dominant arch 25 years ago.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: