You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

wav2vec2-base-960h Dhivehi

Dhivehi (Thaana script) automatic speech recognition model, fine-tuned from facebook/wav2vec2-base-960h using CTC.

Test-set results

Held-out test split (26,875 utterances, never seen during training):

Subset	n	WER	CER
Overall	26,875	19.96 %	3.62 %
Synthetic (TTS-generated)	24,394	16.73 %	2.36 %
Studio voice — male	351	16.04 %	2.46 %
Unknown dataset	2,130	64.71 %	24.80 %

Greedy CTC decoding, no language model.

Usage

import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

REPO = "alakxender/wav2vec2-base-960h-dhivehi"
processor = Wav2Vec2Processor.from_pretrained(REPO)
model = Wav2Vec2ForCTC.from_pretrained(REPO).eval()

audio, _ = librosa.load("sample.wav", sr=16000, mono=True)
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(inputs.input_values).logits
pred_ids = logits.argmax(dim=-1)
print(processor.batch_decode(pred_ids)[0])

Training

Base model: facebook/wav2vec2-base-960h (95 M params)
CTC head: re-initialised for a 52-token Dhivehi character vocabulary built from training transcripts (Thaana block U+0780–U+07BF + word delimiter + [UNK], [PAD])
Audio: 16 kHz mono, clips 1–30 s
Training data: ~484 K utterances combining synthetic and recorded sources, 90/5/5 train/val/test split, deduped by transcript
Epochs: 10 (full run)
Effective batch: 32 (per-device 8 × gradient accumulation 4)
Optimiser: AdamW, peak LR 3e-4, linear schedule, 10 % warmup, bf16
Feature encoder: frozen (standard for wav2vec2 fine-tuning)

Limitations

WER on out-of-distribution audio is significantly higher (~65 %); the model has not seen enough variety of acoustic conditions to generalise beyond its training distribution.
Synthetic-heavy training distribution means the model leans toward TTS-like acoustics. Expect higher WER on novel speakers and recording conditions.
No external language model; output is character-level greedy CTC. Rescoring with a Dhivehi KenLM should give several absolute % WER improvement.

Intended use

Research and experimentation on Dhivehi ASR. Not a production-ready model for general-purpose transcription without further fine-tuning on the target domain.

Downloads last month: 380

Safetensors

Model size

94.4M params

Tensor type

F32

Model tree for alakxender/wav2vec2-base-960h-dhivehi

Base model

facebook/wav2vec2-base-960h

Finetuned

(181)

this model