Distil-Whisper: A New Speech To Text Model By HuggingFace

Hello Learners…

Welcome to the blog…

Introduction
Why Distil-Whisper?
Distil-Whisper: A New Speech To Text Model By HuggingFace
What’s the difference Between Distil-Whisper and OpenAI’s Whisper Model?
How Distil-Whisper Perform Well?
Why use Distil-Whisper?
Summary
References

Introduction

In this post, we discuss about Distil-Whisper: A New Speech-To-Text Model By HuggingFace.

Why Distil-Whisper?

OpenAI’s Whisper yields astonishing accuracy for most audio, but it’s too slow and expensive for most production use cases. In addition, it has a tendency to hallucinate.

Distil-Whisper: A New Speech To Text Model By HuggingFace

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller and performs within 1% WER on out-of-distribution evaluation sets.

What’s the difference Between Distil-Whisper and OpenAI’s Whisper Model?

6x faster than the original Whisper from OpenAI
Equivalent accuracy to Whisper Large
Distilled on 20k hrs of opensource audio data
new architecture with a frozen encoder but reduced decoder layers
Outperforms Whisper on long-form audio
<1% word error rate on held-out eval sets
To be released under an MIT license

How Distil-Whisper Perform Well?

Encoding takes O(1) passes while decoding takes O(N) => Reducing decoder layers is N time more effective.

They keep the whole encoder, but only 2 decoder layers.

The encoder is frozen during distillation to ensure Whisper’s robustness to noise is kept.

To make sure Distil-Whisper does not inherit hallucinations, they filtered out all data samples below a certain WER threshold. By doing so, they are able to reduce hallucinations and actually beat the teacher on long-form audio evaluation.

Distil-Whisper: A New Speech To Text Model By HuggingFace

Distil-Whisper is trained on a knowledge distillation objective. Specifically, it is trained to minimize the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on pseudo-labeled audio data.

they train Distil-Whisper on a total of 22k hours of pseudo-labeled audio data, spanning 10 domains with over 18k speakers

This diverse audio dataset is paramount to ensuring the robustness of Distil-Whisper to different datasets and domains.

In addition, They use a WER filter to discard pseudo-labels where Whisper mis-transcribes or hallucinates. This greatly improves the WER performance of the downstream distilled model.

GitHub URL:

https://github.com/huggingface/distil-whisper

Why use Distil-Whisper?

Distil-Whisper is designed to be a drop-in replacement for Whisper on English ASR. Here are 4 reasons for making the switch to Distil-Whisper:

1. Faster inference:

6 times faster inference speed, while performing to within 1% WER of Whisper on out-of-distribution audio

2. Robustness to noise

demonstrated by strong WER performance at low signal-to-noise ratios

3. Robustness to hallucinations

quantified by 1.3 times fewer repeated 5-gram word duplicates (5-Dup.) and a 2.1% lower insertion error rate (IER) than Whisper

4. Designed for speculative decoding

Distil-Whisper can be used as an assistant model to Whisper, giving 2 times faster inference speed while mathematically ensuring the same outputs as the Whisper model

Summary

In conclusion, Distil-Whisper, the latest addition to Hugging Face’s remarkable lineup of language models, brings a new level of efficiency and accuracy to the world of speech-to-text conversion.

Also, you can learn more about these types of models from the below URL:

https://galaxyofai.com/tag/speech-to-text/

Happy Learning And Keep Learning…

Thank You…

References

https://github.com/huggingface/distil-whisper

1 thought on “Distil-Whisper: A New Speech To Text Model By HuggingFace”

Temp email

January 22, 2024 at 12:41 pm

It was great seeing how much work you put into it. Even though the design is nice and the writing is stylish, you seem to be having trouble with it. I think you should really try sending the next article. I’ll definitely be back for more of the same if you protect this hike.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Distil-Whisper: A New Speech To Text Model By HuggingFace

Table Of Contents

Introduction

Why Distil-Whisper?

Distil-Whisper: A New Speech To Text Model By HuggingFace

What’s the difference Between Distil-Whisper and OpenAI’s Whisper Model?

How Distil-Whisper Perform Well?

Why use Distil-Whisper?

1. Faster inference:

2. Robustness to noise

3. Robustness to hallucinations

4. Designed for speculative decoding

Summary

References

1 thought on “Distil-Whisper: A New Speech To Text Model By HuggingFace”

Leave a Comment Cancel reply