You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

wav2vec2-base-960h Dhivehi

Dhivehi (Thaana script) automatic speech recognition model, fine-tuned from facebook/wav2vec2-base-960h using CTC.

Test-set results

Held-out test split (26,875 utterances, never seen during training):

Subset n WER CER
Overall 26,875 19.96 % 3.62 %
Synthetic (TTS-generated) 24,394 16.73 % 2.36 %
Studio voice โ€” male 351 16.04 % 2.46 %
Unknown dataset 2,130 64.71 % 24.80 %

Greedy CTC decoding, no language model.

Usage

import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

REPO = "alakxender/wav2vec2-base-960h-dhivehi"
processor = Wav2Vec2Processor.from_pretrained(REPO)
model = Wav2Vec2ForCTC.from_pretrained(REPO).eval()

audio, _ = librosa.load("sample.wav", sr=16000, mono=True)
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(inputs.input_values).logits
pred_ids = logits.argmax(dim=-1)
print(processor.batch_decode(pred_ids)[0])

Training

  • Base model: facebook/wav2vec2-base-960h (95 M params)
  • CTC head: re-initialised for a 52-token Dhivehi character vocabulary built from training transcripts (Thaana block U+0780โ€“U+07BF + word delimiter + [UNK], [PAD])
  • Audio: 16 kHz mono, clips 1โ€“30 s
  • Training data: ~484 K utterances combining synthetic and recorded sources, 90/5/5 train/val/test split, deduped by transcript
  • Epochs: 10 (full run)
  • Effective batch: 32 (per-device 8 ร— gradient accumulation 4)
  • Optimiser: AdamW, peak LR 3e-4, linear schedule, 10 % warmup, bf16
  • Feature encoder: frozen (standard for wav2vec2 fine-tuning)

Limitations

  • WER on out-of-distribution audio is significantly higher (~65 %); the model has not seen enough variety of acoustic conditions to generalise beyond its training distribution.
  • Synthetic-heavy training distribution means the model leans toward TTS-like acoustics. Expect higher WER on novel speakers and recording conditions.
  • No external language model; output is character-level greedy CTC. Rescoring with a Dhivehi KenLM should give several absolute % WER improvement.

Intended use

Research and experimentation on Dhivehi ASR. Not a production-ready model for general-purpose transcription without further fine-tuning on the target domain.

Downloads last month
380
Safetensors
Model size
94.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for alakxender/wav2vec2-base-960h-dhivehi

Finetuned
(181)
this model