Whisper Large-v3-Turbo Hindi LoRA

A LoRA fine-tuned adapter for openai/whisper-large-v3-turbo optimized for Hindi (Devanagari) speech recognition.

Results

Model	WER (%)	Eval Set
`openai/whisper-large-v3-turbo` (baseline)	35.56	FLEURS hi_in test (n=418)
+ LoRA fine-tune (this model)	22.25	FLEURS hi_in test (n=418)
+ CTranslate2 INT8 deployment	22.70	FLEURS hi_in test (n=418)

37.4% relative WER reduction. INT8 deployment via faster-whisper adds only 0.45% WER degradation.

Evaluation uses Whisper-default text normalization. See Normalization Notes below.

Comparison with Other Hindi ASR Models

Model	WER (%)	Method	Training Data
collabora/whisper-large-v2-hindi	5.33	Full fine-tune	Multi-corpus (100h+)
vasista22/whisper-hindi-large-v2	6.80	Full fine-tune	Multi-corpus (100h+)
`openai/whisper-large-v3-turbo`	35.56	Zero-shot	—
This model (LoRA)	22.25	LoRA (3.33% params)	FLEURS only (~3.5h)

Note: The collabora and vasista22 models are full fine-tunes trained on hundreds of hours of multi-corpus Hindi data. This model uses only ~3.5 hours of FLEURS data with a lightweight LoRA adapter, making it a fundamentally different trade-off: minimal data and compute for significant WER improvement over the zero-shot baseline.

Training Curve

Step	Train Loss	Eval Loss	Eval WER (%)
50	0.263	0.259	29.40
100	0.210	0.234	25.74
150	0.145	0.223	24.49
200	0.148	0.217	23.43
250	0.146	0.213	23.82
300	0.096	0.215	22.42
350	0.109	0.215	22.50

Best checkpoint: step 300 (lowest val WER). Test WER: 22.25%.

How to Use

With PEFT (LoRA adapter)

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

BASE_MODEL = "openai/whisper-large-v3-turbo"
ADAPTER = "Tachyeon/whisper-large-v3-turbo-hindi-lora"

processor = WhisperProcessor.from_pretrained(BASE_MODEL)
base_model = WhisperForConditionalGeneration.from_pretrained(
    BASE_MODEL, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model = model.to("cuda").eval()

# Transcribe (audio_array: 16kHz float32 numpy array)
input_features = processor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to("cuda", dtype=torch.bfloat16)

with torch.inference_mode():
    predicted_ids = model.generate(
        input_features, language="hi", task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

With faster-whisper (merged + CTranslate2)

For production deployment, merge the adapter and convert to CTranslate2:

# Merge LoRA → convert → evaluate
python convert_and_eval.py --lora-dir outputs/whisper-large-v3-turbo-hindi-lora --quant int8 --gpu 0

from faster_whisper import WhisperModel

model = WhisperModel("path/to/ct2-model", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="hi", beam_size=1)
print(" ".join(seg.text.strip() for seg in segments))

Full pipeline code (data prep → training → deployment): github.com/ipritamdash/whisper-hindi-lora

LoRA Configuration

Parameter	Value
Rank (r)	32
Alpha	64 (2x rank)
Dropout	0.05
Target Modules	`q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`
Trainable Parameters	27,852,800 / 836,730,880 (3.33%)
Bias	none

Architecture choice follows LoRA-Whisper (arXiv:2406.06619): encoder+decoder targeting on all linear layers outperforms decoder-only or q/v-only configurations.

Training Details

Parameter	Value
Base Model	`openai/whisper-large-v3-turbo` (809M params)
Dataset	google/fleurs `hi_in`
Train / Val / Test	2,120 / 239 / 418 samples
Epochs	3
Learning Rate	1e-4 (linear decay)
Warmup Steps	50
Batch Size	4 (x4 gradient accumulation = effective 16)
Optimizer	AdamW (weight_decay=0.01)
Precision	BFloat16
Gradient Checkpointing	Enabled
Hardware	NVIDIA A10G (23GB VRAM)
Training Time	45 minutes
Seed	42

Framework Versions

Transformers: 4.57.3
PEFT: 0.18.1
PyTorch: 2.6.0+cu124
Datasets: 3.6.0

Dataset

Google FLEURS Hindi (hi_in):

Domain: Read speech from Wikipedia sentences
Audio: 16kHz mono, Devanagari script
License: CC BY 4.0
Size: ~3.5 hours across train/val/test

Normalization Notes

Hindi ASR evaluation is sensitive to text normalization. Whisper's default normalizer strips diacritics and simplifies conjunct consonants, which can inflate apparent accuracy but loses semantic precision.

WER numbers above use Whisper-default normalization for comparability with other HuggingFace models. For production Hindi ASR, consider evaluation with IndicNLP normalizer.

Limitations

Training data scope: Trained on FLEURS read speech (~3.5h). Performance on conversational, noisy, or accented Hindi may vary.
Language detection: Fine-tuning on a single language can degrade Whisper's multilingual detection. Set language="hi" explicitly.
Code-mixing: Performance on Hindi-English (Hinglish) is not evaluated.
Base model biases: Any biases in whisper-large-v3-turbo carry through.

Citation

@misc{dash2026whisper_hindi_lora,
  author = {Pritam Dash},
  title = {Whisper Large-v3-Turbo Hindi LoRA Fine-tune},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Tachyeon/whisper-large-v3-turbo-hindi-lora}
}

References

Whisper paper (Radford et al., 2023)
LoRA paper (Hu et al., 2021)
LoRA-Whisper (Yang et al., 2024)
FLEURS (Conneau et al., 2023)

Downloads last month: 4

Model tree for Tachyeon/whisper-large-v3-turbo-hindi-lora

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Adapter

(122)

this model

Dataset used to train Tachyeon/whisper-large-v3-turbo-hindi-lora

Papers for Tachyeon/whisper-large-v3-turbo-hindi-lora

Evaluation results

WER on Google FLEURS (hi_in)
test set self-reported

22.250