Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does insanely-fast-whisper use beam size of 5 or 1? And what is the speed comparison when set to 5?

Ideally it also exposes that parameter to the user.

Speed comparisons seem moot when quality is sacrificed for me, I'm working with very poor audio quality so transcription quality matters.



It's beam size 1. From my quick tests on a Colab T4, CTranslate2 (faster-whisper's backend) is about 30% faster with like for like settings. I decoded the audio, got mel features, split into 30s segments, and ran it batched (beam size 1, batch size 24, no temperature fallback passes). Takes a bit more effort than a cli utility but isn't too hard.

Side note, the insanely fast whisper readme gives benchmarks on an A100 but only the FA2 lines were. The rest were on a T4 looking at the notebooks/history. Turing doesn't support FA2 so the gap should be smaller with it, but based on the distil-whisper paper CTranslate2 is probably still faster.

TensorRT-LLM might be faster but I haven't looked into it yet.


Hugging Face Whisper (the backend to insanely-fast-whisper) now supports PyTorch SDPA attention with PyTorch>=2.1.1

It's enabled by default with the latest Transformers version, so just make sure you have:

* torch>=2.1.1

* transformers>=4.36.0


Nice, thanks for your work on everything Whisper related. I tested it a couple weeks ago which largely matched the results in the insanely fast whisper notebook. Comparison was with BetterTransformers.

I just reran the notebook with 4.36.1 (minus the to_bettertransformer line) but it was slower (the batch size 24 section took 8 vs 5 min). Is there something I need to change? Going back to 4.35.2 gives the old numbers so the T4 instance seems fine.


Our comparisons were a little while ago so I apologize I can’t remember if we used BS 1 or 5 - whichever we picked, we were consistent across models.

Insanely fast whisper (god I hate the name) is really a CLI around Transformers’ whisper pipeline, so you can just use that and use any of the settings Transformers exposes, which includes beam size.

We also deal with very poor audio, which is one of the reasons we went with faster whisper. However, we have identified failure modes in faster whisper that are only present because of the conditioning on the previous segment, so everything is really a trade off.


Indeed, insanely-fast-whisper supports beam-search with a small code modification to this code snippet: https://huggingface.co/openai/whisper-large-v3

Just call the pipeline with:

result = pipe(sample, generate_kwargs={"num_beams": 5})




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: