Also, this is not comparing against an optimized Nvidia implementation. There are faster implementations of Whisper.
Edit: OK I took the bait. I downloaded the 10 minute file he used and ran it on my 4090 with insanely-fast-whisper, which took two commands to install. Using whisper-large-v3 the file is transcribed in less than eight seconds. Fifteen seconds if you include the model loading time before transcription starts (obviously this extra time does not depend on the length of the audio file).
That makes the 4090 somewhere between 6 and 12 times faster than Apple's best. It's also much cheaper than M2 Ultra if you already have a gaming PC to put it in, and still cheaper even if you buy a whole prebuilt PC with it.
This should not be surprising to people, but I see a lot of wishful thinking here from people who own high end Macs and want to believe they are good at everything. Yes, Apple's M-series chips are very impressive and the large RAM is great, but they are not competitive with Nvidia at the high end for ML.
You mean 'https://github.com/Vaibhavs10/insanely-fast-whisper' ? Did not know that until now. I'm running all of that since over ~10 months and just had it running. Happy to try that out. The GPU is fully utilized by using whisper and pytorch with cuda and all. Thanks for the link.
Surely the fact that IFW uses batching makes it apples to oranges? The MLX-enabled version didn’t batch, did it? That fundamentally changes the nature of the operation. Wouldn’t the better comparison be faster-whisper?
I don't know exactly what the MLX version did, but you're probably right. I'd love to see the MLX side optimized to the max as well. I'm confident that it would not reach the performance of the 4090, but it might do better.
That said, for practical purposes, the ready availability of Nvidia-optimized versions of every ML system is a big advantage in itself.
Yeah, I think everyone knows that Nvidia is doing a cracker job. But it is good to just be specific about these benchmarks because numbers get thrown around and it turns out people are testing different things. The other thing is that Apple is extracting this performance on a laptop, at ~1/8 the power draw of the desktop Nvidia card.
In any event, it’s super cool to see such huge leaps just in the past year on how easy it is to run this stuff locally. Certainly looking very promising.
The M2 Ultra that got the best numbers that I was comparing to is not in a laptop. Regardless, you're probably right that the power consumption is significantly lower per unit time. However, is it lower per unit work done? It would be interesting to see a benchmark optimized for power. Nvidia's power consumption can usually be significantly reduced without much performance cost.
Also, the price difference between a prebuilt 4090 PC and a M2 Ultra Mac Studio can buy a lot of kilowatt hours.
Would you be so kind as to link to a guide for your method or share it in a comment yourself?
I installed following the official docs and found it much, much slower, although I sadly don't have a 4090, instead a 3080 Ti 12GB (just big enough to load the large whisper model into GPU memory).
I just ran it again and happened to get an even better time, under 7 seconds without loading and 13.08 seconds including loading. In case anyone is curious about the use of Flash Attention, I tried without it and transcription took under 10 seconds, 15.3 including loading.
Another question that's only slightly related, but while we're here...
Using OAI's paid Whisper API, you can give a text prompt to a) set the tone/style of the transcription and b) teach it technical terms, names etc that it might not be familiar with and should expect in the audio to transcribe.
Am I correct that this isn't possible with any released versions of Whisper, or is there a way to do it on my machine that I'm not aware of?
You can definitely do this with the open source version. Many transcription implementations use it to maintain context between the max-30-second chunks Whisper natively supports.
I'll try to understand some of how stuff like faster-whisper works when I've got time over the weekend, but I fear it may be too complex for me...
I was rather hoping for a guide of just how to either adapt classic whisper usage or adapt one of the optimised ones like faster-whisper (which I've just set up in a docker container but that's used up all the time I've got for playing around right now) to take a text prompt with the audio file.
Cheers, I've been wanting to get into doing something else with my 4090 order than multi monitor simulator gaming, quad screen workstation work - and this will get me kicked off!
The 4090 is an absolute beast, runs extremely quiet and simply powers through everything. DCS pushes it to the limit, but the resulting experience is simply stunning. Mine's coupled to a 7800x3d which uses hardly any power at all, absolutely love it.
If you're looking for something easy to try out, try my early demo that hooks Whisper to an LLM and TTS so you can have a real time speech conversation with your local GPU that feels like talking to a person! It's way faster than ChatGPT: https://apps.microsoft.com/detail/9NC624PBFGB7
I just can't get it to work, it errors out with 'NotImplementedError: The model type whisper is not yet supported to be used with BetterTransformer.' Did you happen to run into this problem?
Sorry, I didn't encounter that error. It worked on the first try for me. I have wished many times that the ML community didn't settle on Python for this reason...
Yes, H100 would be faster still, and Grace Hopper perhaps even somewhat faster. But Apple doesn't have comparable datacenter-only products, so it's also interesting to see the comparison of consumer hardware. Also, 4090 is cheaper than Apple's best, but H100 is more than both (if you are even allowed to buy it).
I’m afraid the article as well as your benchmarks can be misleading because there are a lot of different whisper implementations out there.
For ex ctranslate optimized whisper “implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering…”
Intuitively, I would agree with your conclusion about Apple’s M-Series being impressive for what they do but not generally competitive with Nvidia in ML.
Objectively however, I don’t see concluding much with what’s on offer here. Once you start changing libraries, kernels, transformer code, etc you end up with an apples to oranges comparison.
I think it's fair to compare the fastest available implementation on both platforms. I suspect that the MLX version can be optimized further. However, it will not close a 10x gap.
Maybe I’m not seeing it right, but comparing the source of Apple’s Whisper to Python Whisper seems there are minimal changes to redirect certain operations to using MLX.
There is also cpp Whisper (https://github.com/ggerganov/whisper.cpp) which seems to have it’s own kind of optimizations for Apple Silicon - I don’t think this was the one used with Nvidia during the test.
I don't think whisper was optimized for apple silicon. Doesn't it just use MLX? I mean if using an API for a platform counts as specifically optimized then the Nvidia version is "optimized" as well since it's probably using CUDA.
Edit: OK I took the bait. I downloaded the 10 minute file he used and ran it on my 4090 with insanely-fast-whisper, which took two commands to install. Using whisper-large-v3 the file is transcribed in less than eight seconds. Fifteen seconds if you include the model loading time before transcription starts (obviously this extra time does not depend on the length of the audio file).
That makes the 4090 somewhere between 6 and 12 times faster than Apple's best. It's also much cheaper than M2 Ultra if you already have a gaming PC to put it in, and still cheaper even if you buy a whole prebuilt PC with it.
This should not be surprising to people, but I see a lot of wishful thinking here from people who own high end Macs and want to believe they are good at everything. Yes, Apple's M-series chips are very impressive and the large RAM is great, but they are not competitive with Nvidia at the high end for ML.