Also, this is not comparing against an optimized Nvidia implementation. There ar...

owehrens · on Dec 13, 2023

You mean 'https://github.com/Vaibhavs10/insanely-fast-whisper' ? Did not know that until now. I'm running all of that since over ~10 months and just had it running. Happy to try that out. The GPU is fully utilized by using whisper and pytorch with cuda and all. Thanks for the link.

m463 · on Dec 13, 2023

lol nice:

https://owehrens.com/content/images/size/w1600/2023/12/PerfC...

darkteflon · on Dec 13, 2023

Surely the fact that IFW uses batching makes it apples to oranges? The MLX-enabled version didn’t batch, did it? That fundamentally changes the nature of the operation. Wouldn’t the better comparison be faster-whisper?

modeless · on Dec 13, 2023

I don't know exactly what the MLX version did, but you're probably right. I'd love to see the MLX side optimized to the max as well. I'm confident that it would not reach the performance of the 4090, but it might do better.

That said, for practical purposes, the ready availability of Nvidia-optimized versions of every ML system is a big advantage in itself.

darkteflon · on Dec 13, 2023

Yeah, I think everyone knows that Nvidia is doing a cracker job. But it is good to just be specific about these benchmarks because numbers get thrown around and it turns out people are testing different things. The other thing is that Apple is extracting this performance on a laptop, at ~1/8 the power draw of the desktop Nvidia card.

In any event, it’s super cool to see such huge leaps just in the past year on how easy it is to run this stuff locally. Certainly looking very promising.

modeless · on Dec 13, 2023

The M2 Ultra that got the best numbers that I was comparing to is not in a laptop. Regardless, you're probably right that the power consumption is significantly lower per unit time. However, is it lower per unit work done? It would be interesting to see a benchmark optimized for power. Nvidia's power consumption can usually be significantly reduced without much performance cost.

Also, the price difference between a prebuilt 4090 PC and a M2 Ultra Mac Studio can buy a lot of kilowatt hours.

swores · on Dec 13, 2023

Would you be so kind as to link to a guide for your method or share it in a comment yourself?

I installed following the official docs and found it much, much slower, although I sadly don't have a 4090, instead a 3080 Ti 12GB (just big enough to load the large whisper model into GPU memory).

modeless · on Dec 13, 2023

I'm running on Linux with a 13900k, 64 GB RAM, and I already have CUDA installed. Install commands directly from the README:

    pipx install insanely-fast-whisper

    pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation

To transcribe the file:

    insanely-fast-whisper --flash True --file-name ~/Downloads/podcast_1652_was_jetzt_episode_1289963_update_warum_streiken_sie_schon_wieder_herr_zugchef.mp3 --language german --model-name openai/whisper-large-v3

The file can be downloaded at: https://adswizz.podigee-cdn.net/version/1702050198/media/pod...

I just ran it again and happened to get an even better time, under 7 seconds without loading and 13.08 seconds including loading. In case anyone is curious about the use of Flash Attention, I tried without it and transcription took under 10 seconds, 15.3 including loading.

owehrens · on Dec 13, 2023

Thanks so much again. Got it working. 8 seconds. Nvidia is the king. Updated the blog post.

darkteflon · on Dec 13, 2023

I think insanely-faster-whisper uses batching, so faster-whisper (which doesn’t) might be a fairer comparison for the purposes of your post.

swores · on Dec 13, 2023

Thanks!

Another question that's only slightly related, but while we're here...

Using OAI's paid Whisper API, you can give a text prompt to a) set the tone/style of the transcription and b) teach it technical terms, names etc that it might not be familiar with and should expect in the audio to transcribe.

Am I correct that this isn't possible with any released versions of Whisper, or is there a way to do it on my machine that I'm not aware of?

modeless · on Dec 13, 2023

You can definitely do this with the open source version. Many transcription implementations use it to maintain context between the max-30-second chunks Whisper natively supports.

swores · on Dec 13, 2023

I'll try to understand some of how stuff like faster-whisper works when I've got time over the weekend, but I fear it may be too complex for me...

I was rather hoping for a guide of just how to either adapt classic whisper usage or adapt one of the optimised ones like faster-whisper (which I've just set up in a docker container but that's used up all the time I've got for playing around right now) to take a text prompt with the audio file.

sundvor · on Dec 13, 2023

Cheers, I've been wanting to get into doing something else with my 4090 order than multi monitor simulator gaming, quad screen workstation work - and this will get me kicked off!

The 4090 is an absolute beast, runs extremely quiet and simply powers through everything. DCS pushes it to the limit, but the resulting experience is simply stunning. Mine's coupled to a 7800x3d which uses hardly any power at all, absolutely love it.

modeless · on Dec 13, 2023

If you're looking for something easy to try out, try my early demo that hooks Whisper to an LLM and TTS so you can have a real time speech conversation with your local GPU that feels like talking to a person! It's way faster than ChatGPT: https://apps.microsoft.com/detail/9NC624PBFGB7

pixelpoet · on Dec 14, 2023

This sounds awesome! Will check it out soon

owehrens · on Dec 13, 2023

I just can't get it to work, it errors out with 'NotImplementedError: The model type whisper is not yet supported to be used with BetterTransformer.' Did you happen to run into this problem?

modeless · on Dec 13, 2023

Sorry, I didn't encounter that error. It worked on the first try for me. I have wished many times that the ML community didn't settle on Python for this reason...

justinclift · on Dec 13, 2023

> Nvidia at the high end for ML.

Wouldn't the high end for Nvidia be their dedicated gear rather than a 4090?

modeless · on Dec 13, 2023

Yes, H100 would be faster still, and Grace Hopper perhaps even somewhat faster. But Apple doesn't have comparable datacenter-only products, so it's also interesting to see the comparison of consumer hardware. Also, 4090 is cheaper than Apple's best, but H100 is more than both (if you are even allowed to buy it).

WhitneyLand · on Dec 13, 2023

I’m afraid the article as well as your benchmarks can be misleading because there are a lot of different whisper implementations out there.

For ex ctranslate optimized whisper “implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering…”

Intuitively, I would agree with your conclusion about Apple’s M-Series being impressive for what they do but not generally competitive with Nvidia in ML.

Objectively however, I don’t see concluding much with what’s on offer here. Once you start changing libraries, kernels, transformer code, etc you end up with an apples to oranges comparison.

modeless · on Dec 13, 2023

I think it's fair to compare the fastest available implementation on both platforms. I suspect that the MLX version can be optimized further. However, it will not close a 10x gap.

isodev · on Dec 13, 2023

It also wasn’t optimised for Apple Silicon. Given how the different platforms performed in this test, the conclusions seem pretty solid.

jbellis · on Dec 13, 2023

He is literally comparing whisper.cpp on the 4090 with an optimized-for-apple-silicon-by-apple-engineers version on the M1.

ETA: actually it's unclear from the article if the whisper optimizations were done by apple engineers, but it's definitely an optimized version.

isodev · on Dec 13, 2023

Maybe I’m not seeing it right, but comparing the source of Apple’s Whisper to Python Whisper seems there are minimal changes to redirect certain operations to using MLX.

There is also cpp Whisper (https://github.com/ggerganov/whisper.cpp) which seems to have it’s own kind of optimizations for Apple Silicon - I don’t think this was the one used with Nvidia during the test.

rowanG077 · on Dec 13, 2023

I don't think whisper was optimized for apple silicon. Doesn't it just use MLX? I mean if using an API for a platform counts as specifically optimized then the Nvidia version is "optimized" as well since it's probably using CUDA.