Whisper Large-v3-Turbo Hindi LoRA

A LoRA fine-tuned adapter for openai/whisper-large-v3-turbo optimized for Hindi (Devanagari) speech recognition.

Results

Model WER (%) Eval Set
openai/whisper-large-v3-turbo (baseline) 35.56 FLEURS hi_in test (n=418)
+ LoRA fine-tune (this model) 22.25 FLEURS hi_in test (n=418)
+ CTranslate2 INT8 deployment 22.70 FLEURS hi_in test (n=418)

37.4% relative WER reduction. INT8 deployment via faster-whisper adds only 0.45% WER degradation.

Evaluation uses Whisper-default text normalization. See Normalization Notes below.

Comparison with Other Hindi ASR Models

Model WER (%) Method Training Data
collabora/whisper-large-v2-hindi 5.33 Full fine-tune Multi-corpus (100h+)
vasista22/whisper-hindi-large-v2 6.80 Full fine-tune Multi-corpus (100h+)
openai/whisper-large-v3-turbo 35.56 Zero-shot
This model (LoRA) 22.25 LoRA (3.33% params) FLEURS only (~3.5h)

Note: The collabora and vasista22 models are full fine-tunes trained on hundreds of hours of multi-corpus Hindi data. This model uses only ~3.5 hours of FLEURS data with a lightweight LoRA adapter, making it a fundamentally different trade-off: minimal data and compute for significant WER improvement over the zero-shot baseline.

Training Curve

Step Train Loss Eval Loss Eval WER (%)
50 0.263 0.259 29.40
100 0.210 0.234 25.74
150 0.145 0.223 24.49
200 0.148 0.217 23.43
250 0.146 0.213 23.82
300 0.096 0.215 22.42
350 0.109 0.215 22.50

Best checkpoint: step 300 (lowest val WER). Test WER: 22.25%.

How to Use

With PEFT (LoRA adapter)

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

BASE_MODEL = "openai/whisper-large-v3-turbo"
ADAPTER = "Tachyeon/whisper-large-v3-turbo-hindi-lora"

processor = WhisperProcessor.from_pretrained(BASE_MODEL)
base_model = WhisperForConditionalGeneration.from_pretrained(
    BASE_MODEL, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model = model.to("cuda").eval()

# Transcribe (audio_array: 16kHz float32 numpy array)
input_features = processor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to("cuda", dtype=torch.bfloat16)

with torch.inference_mode():
    predicted_ids = model.generate(
        input_features, language="hi", task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

With faster-whisper (merged + CTranslate2)

For production deployment, merge the adapter and convert to CTranslate2:

# Merge LoRA → convert → evaluate
python convert_and_eval.py --lora-dir outputs/whisper-large-v3-turbo-hindi-lora --quant int8 --gpu 0
from faster_whisper import WhisperModel

model = WhisperModel("path/to/ct2-model", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="hi", beam_size=1)
print(" ".join(seg.text.strip() for seg in segments))

Full pipeline code (data prep → training → deployment): github.com/ipritamdash/whisper-hindi-lora

LoRA Configuration

Parameter Value
Rank (r) 32
Alpha 64 (2x rank)
Dropout 0.05
Target Modules q_proj, k_proj, v_proj, out_proj, fc1, fc2
Trainable Parameters 27,852,800 / 836,730,880 (3.33%)
Bias none

Architecture choice follows LoRA-Whisper (arXiv:2406.06619): encoder+decoder targeting on all linear layers outperforms decoder-only or q/v-only configurations.

Training Details

Parameter Value
Base Model openai/whisper-large-v3-turbo (809M params)
Dataset google/fleurs hi_in
Train / Val / Test 2,120 / 239 / 418 samples
Epochs 3
Learning Rate 1e-4 (linear decay)
Warmup Steps 50
Batch Size 4 (x4 gradient accumulation = effective 16)
Optimizer AdamW (weight_decay=0.01)
Precision BFloat16
Gradient Checkpointing Enabled
Hardware NVIDIA A10G (23GB VRAM)
Training Time 45 minutes
Seed 42

Framework Versions

  • Transformers: 4.57.3
  • PEFT: 0.18.1
  • PyTorch: 2.6.0+cu124
  • Datasets: 3.6.0

Dataset

Google FLEURS Hindi (hi_in):

  • Domain: Read speech from Wikipedia sentences
  • Audio: 16kHz mono, Devanagari script
  • License: CC BY 4.0
  • Size: ~3.5 hours across train/val/test

Normalization Notes

Hindi ASR evaluation is sensitive to text normalization. Whisper's default normalizer strips diacritics and simplifies conjunct consonants, which can inflate apparent accuracy but loses semantic precision.

WER numbers above use Whisper-default normalization for comparability with other HuggingFace models. For production Hindi ASR, consider evaluation with IndicNLP normalizer.

Limitations

  • Training data scope: Trained on FLEURS read speech (~3.5h). Performance on conversational, noisy, or accented Hindi may vary.
  • Language detection: Fine-tuning on a single language can degrade Whisper's multilingual detection. Set language="hi" explicitly.
  • Code-mixing: Performance on Hindi-English (Hinglish) is not evaluated.
  • Base model biases: Any biases in whisper-large-v3-turbo carry through.

Citation

@misc{dash2026whisper_hindi_lora,
  author = {Pritam Dash},
  title = {Whisper Large-v3-Turbo Hindi LoRA Fine-tune},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Tachyeon/whisper-large-v3-turbo-hindi-lora}
}

References

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tachyeon/whisper-large-v3-turbo-hindi-lora

Adapter
(122)
this model

Dataset used to train Tachyeon/whisper-large-v3-turbo-hindi-lora

Papers for Tachyeon/whisper-large-v3-turbo-hindi-lora

Evaluation results