Could somebody explain how the M1 can reach 900 GFlops or more? My knowledge is ...

littlestymaar · on March 4, 2021

It runs on the GPU (as written in the title).

Also, this assumption is wrong too:

> and each instruction needs at least 1 clocktick

Modern[1] CPU use a superscalar architecture[2] which allows them to excute more than one instruction per clock cycle (usually 4 or more[3]).

[1]: well, it's been the case for the past decades actually [2]: https://kb.iu.edu/d/aett [3]: https://stackoverflow.com/questions/37041009/what-is-the-max...

misja111 · on March 4, 2021

Thanks, as you say it must be running on the GPU then because even with 5 superscalar instructions per clocktick the M1 CPU wouldn't be anywhere near 900GFlops. And besides, isn't the thing that makes the M1 so fast the fact that it's a RISC processor that is -not- superscalar?

The headline suggested this had anything to do with the specific qualities of the M1 but apparently this has nothing to do with the M1 CPU? Any modern GPU can easily reach 1 TFlop nowadays.

littlestymaar · on March 4, 2021

> And besides, isn't the thing that makes the M1 so fast the fact that it's a RISC processor that is -not- superscalar?

I don't know where you got the idea that RISC = “not superscalar”, but it's a wrong one. There are a lot of superscalar ARM CPUs out there [1].

> The headline suggested this had anything to do with the specific qualities of the M1 but apparently this has nothing to do with the M1 CPU?

The M1 is a SoC, with a CPU and a GPU (and other things) on the same chip.

> Any modern GPU can easily reach 1 TFlop nowadays.

Yep, but that's still quite a feat to do this on a on a low-power mobile SoC.

[1]: https://en.wikipedia.org/wiki/List_of_ARM_microarchitectures

avianes · on March 4, 2021

Modern processors execute more than one instruction per cycle by executing several instructions in parallel on each core.

In sequential code, it's common that some instructions are independent and can theoretically be executed in parallel, this is measured with ILP (Insturction Level Parallelism). Modern processors exploit ILP by detecting dependencies between instructions and execute independent instructions in parallel. These are superscalar processors.

In addition, some extensions of the instruction sets add instructions that allow you to compute several data at the same time. They are called SIMD extensions (Single Instruction Multiple Data).