omni-ctc-300m-tajik

A Tajik fine-tune of facebook/omniASR-CTC-300M, trained on 1,070 hours of Scribe-labeled Tajik YouTube speech plus FLEURS. On a held-out conversational test set it cuts word error rate from 49.9% (a 5.8-hour baseline fine-tune) to 37.7%.

Files

model.pt: fairseq2 checkpoint (CTC head, 300M parameters)
omniASR_tokenizer_written_v2.model: the character tokenizer the checkpoint expects

Training data

Peacockery/tajik-asr-corpus-v3: 183,146 clips, 1,070.8 hours. 41 Tajik YouTube channels (1,059 h, machine-labeled) plus FLEURS tg_tj (11.8 h, gold). YouTube labels come from an ElevenLabs Scribe ensemble fused by an LLM, then quality-gated at WER <= 0.35 against an independent Scribe pass. FLEURS was weighted to roughly 12% of sampled batches to keep read speech represented against the conversational mass.

Recipe

Fresh fine-tune from the base model: 20,000 steps (about one epoch), peak LR 5e-6, tri-stage schedule, bf16, best-WER checkpoint selection on FLEURS dev (best at step 20,000, dev WER 17.1).

Benchmarks

FLEURS tg_tj test, 599 clips, corpus-level jiwer after shared normalization:

model	WER	CER
omniASR-CTC-300M (no fine-tune)	19.74	5.62
5.8 h fine-tune (FLEURS + Common Voice)	17.34	4.88
this model	17.17	4.90

Held-out conversational YouTube test (1,625 clips from 157 whole videos excluded from training; machine-labeled references):

model	WER	CER
5.8 h fine-tune	49.89	18.88
this model	37.65	14.04

The conversational references are themselves Scribe-derived, so that table measures agreement with the labeling pipeline rather than human ground truth. The FLEURS table uses gold references.

Usage

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
# Register model.pt + the tokenizer as a fairseq2 asset card, then:
pipe = ASRInferencePipeline("your_card_name", device="cuda")
texts = pipe.transcribe([audio], lang=["tgk_Cyrl"], batch_size=8)

The checkpoint loads through fairseq2 0.8.x with the omnilingual-asr inference stack. Audio must be 16 kHz mono; clips above 40 s exceed the pipeline cap.

Downloads last month: -; Downloads are not tracked for this model. How to track