This is the thing about Nvidia. Even if some hardware beats them in a benchmark, if its a popular model, there will be some massively hand optimized CUDA implementation that blows anything else out of the water.
There are some rare exceptions (like GPT-Fast on AMD thanks to PyTorch's hard work on torch.compile, and only in a narrow use case), but I can't think of a single one for Apple Silicon.
The one being benchmarked here is heavily optimised for Apple Silicon. I think there are a few algorithms that Apple uses (like the one tagging faces on iPhones) that are heavily optimised for Apple's own hardware.
I think Apple's API would be as popular as CUDA if you could rent their chips at scale. They're quite efficient machines that don't need a lot of cooling, so I imagine the OPEX of keeping them running 24/7 in big cloud racks would be pretty low if they were optimised for server usage.
Apple seems to focus their efforts on bringing purpose-built LLMs to Apple machines. I can see why it makes sense (just like Google's attempts to bring Tensor cores to mobile) but there's not much practical use in this technology right now. Whisper is the first usable technology like this, but even my Android phone can live translate spoken text into words as an accessibility feature, I don't think Apple can sell Whisper as a product to end users.
> The one being benchmarked here is heavily optimised for Apple Silicon.
I don't think so, in the sense of a hand-optimized CUDA implementation. This just using the MLX API in the same way that you'd use CUDA via PyTorch or something.
Apple would need to make rackmount versions of the machines with replaceable storage and maybe RAM and would really need to really beef up their headless management systems of the machines before they start becoming competitive.
Otherwise you need a whole bunch of custom mac mini style racks and management software which really increases costs and lead times. If you don't believe me, look how expensive AWS macOS machines are compared to linux ones with equivalent performance.
Those are not efficient at all for a data center on size and cost compared to equivalent mac minis or studios on a tray. The rackmount format was made for music production and is very quiet for that reason.
> but I can't think of a single one for Apple Silicon.
The post here is exactly one for Apple Silicon. It compared a naive implementation in PyTorch which may not even keep 4090 busy (for smaller/not-that-compute-intensive models having the entire computation driven by Python is... limiting, which is partly why torch.compile gives amazing improvements) to a purposedly-optimized one (optimized for both CPU/GPU efficiency) for Apple Silicon one.
I wouldn’t be surprised a $2k top of the line GPU is a match/better than the built in accelerator on a Mac. Even if the Mac was slightly faster you could just stick multiple GPUs in a PC.
To me the news here is how well the Mac runs without needing that additional hardware/large power draw on this benchmark.
There are some rare exceptions (like GPT-Fast on AMD thanks to PyTorch's hard work on torch.compile, and only in a narrow use case), but I can't think of a single one for Apple Silicon.