wav2vec2-base-960h Dhivehi
Dhivehi (Thaana script) automatic speech recognition model, fine-tuned from
facebook/wav2vec2-base-960h
using CTC.
Test-set results
Held-out test split (26,875 utterances, never seen during training):
| Subset | n | WER | CER |
|---|---|---|---|
| Overall | 26,875 | 19.96 % | 3.62 % |
| Synthetic (TTS-generated) | 24,394 | 16.73 % | 2.36 % |
| Studio voice โ male | 351 | 16.04 % | 2.46 % |
| Unknown dataset | 2,130 | 64.71 % | 24.80 % |
Greedy CTC decoding, no language model.
Usage
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
REPO = "alakxender/wav2vec2-base-960h-dhivehi"
processor = Wav2Vec2Processor.from_pretrained(REPO)
model = Wav2Vec2ForCTC.from_pretrained(REPO).eval()
audio, _ = librosa.load("sample.wav", sr=16000, mono=True)
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(inputs.input_values).logits
pred_ids = logits.argmax(dim=-1)
print(processor.batch_decode(pred_ids)[0])
Training
- Base model:
facebook/wav2vec2-base-960h(95 M params) - CTC head: re-initialised for a 52-token Dhivehi character vocabulary built from training transcripts (Thaana block U+0780โU+07BF + word delimiter +
[UNK],[PAD]) - Audio: 16 kHz mono, clips 1โ30 s
- Training data: ~484 K utterances combining synthetic and recorded sources, 90/5/5 train/val/test split, deduped by transcript
- Epochs: 10 (full run)
- Effective batch: 32 (per-device 8 ร gradient accumulation 4)
- Optimiser: AdamW, peak LR 3e-4, linear schedule, 10 % warmup, bf16
- Feature encoder: frozen (standard for wav2vec2 fine-tuning)
Limitations
- WER on out-of-distribution audio is significantly higher (~65 %); the model has not seen enough variety of acoustic conditions to generalise beyond its training distribution.
- Synthetic-heavy training distribution means the model leans toward TTS-like acoustics. Expect higher WER on novel speakers and recording conditions.
- No external language model; output is character-level greedy CTC. Rescoring with a Dhivehi KenLM should give several absolute % WER improvement.
Intended use
Research and experimentation on Dhivehi ASR. Not a production-ready model for general-purpose transcription without further fine-tuning on the target domain.
- Downloads last month
- 380
Model tree for alakxender/wav2vec2-base-960h-dhivehi
Base model
facebook/wav2vec2-base-960h