omni-ctc-300m-tajik
A Tajik fine-tune of facebook/omniASR-CTC-300M, trained on 1,070 hours of Scribe-labeled Tajik YouTube speech plus FLEURS. On a held-out conversational test set it cuts word error rate from 49.9% (a 5.8-hour baseline fine-tune) to 37.7%.
Files
model.pt: fairseq2 checkpoint (CTC head, 300M parameters)omniASR_tokenizer_written_v2.model: the character tokenizer the checkpoint expects
Training data
Peacockery/tajik-asr-corpus-v3: 183,146 clips, 1,070.8 hours. 41 Tajik YouTube channels (1,059 h, machine-labeled) plus FLEURS tg_tj (11.8 h, gold). YouTube labels come from an ElevenLabs Scribe ensemble fused by an LLM, then quality-gated at WER <= 0.35 against an independent Scribe pass. FLEURS was weighted to roughly 12% of sampled batches to keep read speech represented against the conversational mass.
Recipe
Fresh fine-tune from the base model: 20,000 steps (about one epoch), peak LR 5e-6, tri-stage schedule, bf16, best-WER checkpoint selection on FLEURS dev (best at step 20,000, dev WER 17.1).
Benchmarks
FLEURS tg_tj test, 599 clips, corpus-level jiwer after shared normalization:
| model | WER | CER |
|---|---|---|
| omniASR-CTC-300M (no fine-tune) | 19.74 | 5.62 |
| 5.8 h fine-tune (FLEURS + Common Voice) | 17.34 | 4.88 |
| this model | 17.17 | 4.90 |
Held-out conversational YouTube test (1,625 clips from 157 whole videos excluded from training; machine-labeled references):
| model | WER | CER |
|---|---|---|
| 5.8 h fine-tune | 49.89 | 18.88 |
| this model | 37.65 | 14.04 |
The conversational references are themselves Scribe-derived, so that table measures agreement with the labeling pipeline rather than human ground truth. The FLEURS table uses gold references.
Usage
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
# Register model.pt + the tokenizer as a fairseq2 asset card, then:
pipe = ASRInferencePipeline("your_card_name", device="cuda")
texts = pipe.transcribe([audio], lang=["tgk_Cyrl"], batch_size=8)
The checkpoint loads through fairseq2 0.8.x with the omnilingual-asr inference stack. Audio must be 16 kHz mono; clips above 40 s exceed the pipeline cap.