Evaluating Whisper Models for Urdu ASR

In this post, we evaluate OpenAI's Whisper models for Urdu Automatic Speech Recognition (ASR) across two different datasets. We also finetune the best performing model to improve its performance.

Whisper Models Evaluated

We evaluated four different Whisper model sizes to understand their performance characteristics on Urdu speech recognition:

Model Finetuning

We finetuned the Whisper Large v2 model to improve its performance on Urdu speech recognition:

Datasets Used for Evaluation

We evaluated the models on two distinct Urdu speech datasets to understand their performance across different speech characteristics and recording conditions.

1. Common Voice Urdu (CV Urdu)

Description: Mozilla's Common Voice dataset contains crowd-sourced speech recordings in many languages, including Urdu. The Urdu portion contains about 20 hours of validated speech from diverse speakers.

Characteristics: Clean recordings with varied accents and speaking styles. Mostly read speech from news articles and other written content.

Size: ~20 hours, ~12,000 utterances

2. Nucleosight ASR Dataset (NS-ASR)

Description: Our proprietary dataset collected specifically for Urdu speech recognition, containing a mix of read speech and spontaneous conversations.

Characteristics: Includes both formal and informal speech patterns, with some recordings featuring background noise and varying audio quality.

Size: ~25 hours, ~15,000 utterances

Unique Features: Contains domain-specific vocabulary from healthcare, finance, and customer service domains.

Evaluation Metrics

We used the following metrics to evaluate model performance:

Word Error Rate (WER): The standard metric for ASR performance, calculated as (Substitutions + Insertions + Deletions) / Total Words
Error Type Analysis: Breakdown of total substitutions, deletions, and insertions

Results and Analysis

1. Whisper Large v2 Results

The Large model provided the best overall performance among the standard Whisper models, though with diminishing returns compared to Medium for Urdu.

Common Voice Urdu

Figure 1: Word Error Rate distribution for Whisper Large v2 on Common Voice Urdu dataset

Whisper Large v2 Top Error Patterns on Common Voice Urdu

Figure 2: Top error patterns for Whisper Large v2 on Common Voice Urdu dataset

Substitutions

1,980

Deletions

1,085

Insertions

472

NS-ASR Dataset

Whisper Large v2 WER Distribution on NS-ASR

Figure 3: Word Error Rate distribution for Whisper Large v2 on NS-ASR dataset

Whisper Large v2 Top Error Patterns on NS-ASR

Figure 4: Top error patterns for Whisper Large v2 on NS-ASR dataset

Substitutions

4,765

Deletions

2,319

Insertions

2,210

2. Finetuned Whisper Large v2 Results

The Large model provided the best overall performance among the standard Whisper models, though with diminishing returns compared to Medium for Urdu.

Common Voice Urdu

Figure 5: Word Error Rate distribution for Finetuned Whisper Large v2 on Common Voice Urdu dataset

Finetuned Whisper Large v2 Top Error Patterns on Common Voice Urdu

Figure 6: Top error patterns for Finetuned Whisper Large v2 on Common Voice Urdu dataset

Substitutions

1,789

↓37%

Deletions

1,053

↓46%

Insertions

988

↓34%

NS-ASR Dataset

Finetuned Whisper Large v2 WER Distribution on NS-ASR

Figure 7: Word Error Rate distribution for Finetuned Whisper Large v2 on NS-ASR dataset

Finetuned Whisper Large v2 Top Error Patterns on NS-ASR

Figure 8: Top error patterns for Finetuned Whisper Large v2 on NS-ASR dataset

Substitutions

1,592

↓66.6%

Deletions

324

↓86.0%

Insertions

368

↓55.3%

Visual Inspection of Transcription Results

To complement our quantitative analysis, we performed a visual inspection of transcription results from different Whisper models. Below are sample audio clips along with their reference Urdu text and model transcriptions.

Sample 1: Common Voice Urdu

Reference Text

ایف بی ار میں چھٹیوں پر پابندی عائد

Whisper Tiny

ایک بیار میں شتیوں کو بر پان دیائد

Whisper Small

ای بی آر میں چھوٹیوں پور پورا دیا اید

Whisper Medium

ایپ بی آر میں چٹیوں پاور پوندی آئید

Whisper Large v2

اہ بی آر میں چھٹیوں پور پوندی آئید

Finetuned Whisper Large v2

ایف بی آر میں چھوٹیوں کا کون دیا ہے

Sample 2: Common Voice Urdu

Reference Text

ٹرکوں پر دلکش نقش ونگار بنانے والے فنکار مسائل کا شکار

Whisper Tiny

برقو پر دلکش نفس وہاں گار بنا نیو علیہ وہاں گار مصائل فشکار

Whisper Small

درکوں پر دلکش نکش ونگار بنانے والے فنگار مسائل کا شکار

Whisper Medium

درکوں پر دلکشھ نقشھ ونگار بنانے والے خنکار مسائل کا شکار

Whisper Large v2

درکوں پر دلکش نقش ونگار بنانے والے فنگار مسائل کا شکار

Finetuned Whisper Large v2

درکوں پر دلکش نقش ونگار بنانے والے فنگار مسائل کا شکار

Sample 3: NS-ASR Dataset

Reference Text

ہاں بھائی گاڑی صحیح ہے آپ کو ٹائم سے گاڑی مل جائے گی بھائی اس کے لیے آپ کو دقت نہیں کرنا پڑے گا

Whisper Tiny

ہوا بای کاری سے یہ ہے آپ کو تن سے گاری مل گائی بای اس کے لیاب کو دیکھت میں کارنہ پر ہے۔

Whisper Small

آہ بھائی کاری سی ہے آپ کو ٹیم سے کاری مل جائے گی بھائی اس کے لئے آپ کو دکت نہیں کرنا پڑھے گا

Whisper Medium

آپ کو دکھت نہیں کرنا پڑے گا

Whisper Large v2

ہاں بھائی گاری صحیح ہے آپ کو ٹیم سے گاری مل جائے گی بھائی اس کے لئے آپ کو دکھت نہیں کرنا پڑے گا

Finetuned Whisper Large v2

ہاں بھائی گاڑی صحیح ہے آپ کو ٹائم سے گاڑی مل جائے گی بھائی اس کے لیے آپ کو دقت نہیں کرنا پڑے گا

Sample 4: NS-ASR Dataset

Reference Text

نہیں چھ سو ریٹ نہیں ہو پائے گا سات سو تک ہو جائے گا

Whisper Tiny

نای چیزون ان میں او برہتا افتر پہ گوڑا ہے

Whisper Small

میں چیزوں نے ایک بارے بارے ساسوں تک ہو جاتا ہے

Whisper Medium

میں 600 نئر نہیں ہوں گا میں 700 تک ہوں گا

Whisper Large v2

نہیں 600 نہیں ہوگا 700 تک ہو جا رہا ہے

Finetuned Whisper Large v2

نہیں چھ سو نہیں ہوگا سات سو تک ہو جائے گا

Sample 5: NS-ASR Dataset

Reference Text

اچھا اچھا اور بہتر سے بہتر مطلب دینے کی کوشش کیجیے گا تاکہ آئندہ شکایت کا موقع آپ کو نہ ملے

Whisper Tiny

بہتر سے بہتا متمتب بہتا متمتب بہتا بہتا متمتب بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہتا بہت

Whisper Small

اچھا اچھا اور بہتر سے بہتر مدد دینے کی کوشید کیجئے گا لیکن آئندہ شکائے کم مقاب کو نہ ملے

Whisper Medium

اچھا اچھا اور بہتر سے بہتر مطب دینے کی کوشش کر دیجے گا لیکن آہندہ شکایت کا موقع آپ کو نہ ملے

Whisper Large v2

اچھا اچھا اور بہتر سے بہتر دینے کی کوشش کی جئے گا لیکن آئندہ شکایت کا موقع آپ کو نہ ملے

Finetuned Whisper Large v2

اچھا اچھا اور بہتر سے بہتر مطلب دینے کی کوشش کیجیے گا تاکہ آئندہ شکایت کا موقع آپ کو نہ ملے جئے گا لیکن آئندہ شکایت کا موقع آپ کو نہ ملے

Key Findings

Figure 9: Word Error Rate (WER) Comparison

Performance Across Model Sizes

The results show significant improvement from Tiny to Small models, with more gradual improvements from Small to Medium and Large models. The finetuned model achives remarkable improvement in performance.

Dataset Comparison

Performance was generally better on NS-ASR than our Common voice dataset, likely due to:

Model was finetuned on data in NS-ASR
Diffireent accent in Common voice.
Noise/low voice quality

Error Type Analysis

Across all models, substitutions were the most common error type, followed by deletions and then insertions. The ratio of substitutions to other error types increased with model size, suggesting larger models are better at avoiding complete misses (deletions) but still struggle with similar-sounding words.

Conclusion

Our evaluation of Whisper models for Urdu ASR shows that:

The Tiny model shows poor performance on both datasets.
The Medium model provides the best value for high-accuracy scenarios
The finetuned model outperforms all whisper models on urdu ASR.
Our NS-ASR dataset reveals important performance characteristics not visible in Common Voice evaluations

At Nucleosight, we're using these insights to optimize our Urdu speech recognition pipeline and develop specialized models for domain-specific applications.

Improving Whisper Models for Urdu Automatic Speech Recognition

Whisper Models Evaluated

Model Finetuning

Datasets Used for Evaluation

Evaluation Metrics

Results and Analysis

1. Whisper Large v2 Results

Common Voice Urdu

NS-ASR Dataset

2. Finetuned Whisper Large v2 Results

Common Voice Urdu

NS-ASR Dataset

Visual Inspection of Transcription Results

Sample 1: Common Voice Urdu

Sample 2: Common Voice Urdu

Sample 3: NS-ASR Dataset

Sample 4: NS-ASR Dataset

Sample 5: NS-ASR Dataset

Key Findings

Performance Across Model Sizes

Dataset Comparison

Error Type Analysis

Conclusion

Nucleosight

Get In Touch

Popular Link

Our Services