omni-ctc-300m-tajik

A Tajik fine-tune of facebook/omniASR-CTC-300M, trained on 1,070 hours of Scribe-labeled Tajik YouTube speech plus FLEURS. On a held-out conversational test set it cuts word error rate from 49.9% (a 5.8-hour baseline fine-tune) to 37.7%.

Files

  • model.pt: fairseq2 checkpoint (CTC head, 300M parameters)
  • omniASR_tokenizer_written_v2.model: the character tokenizer the checkpoint expects

Training data

Peacockery/tajik-asr-corpus-v3: 183,146 clips, 1,070.8 hours. 41 Tajik YouTube channels (1,059 h, machine-labeled) plus FLEURS tg_tj (11.8 h, gold). YouTube labels come from an ElevenLabs Scribe ensemble fused by an LLM, then quality-gated at WER <= 0.35 against an independent Scribe pass. FLEURS was weighted to roughly 12% of sampled batches to keep read speech represented against the conversational mass.

Recipe

Fresh fine-tune from the base model: 20,000 steps (about one epoch), peak LR 5e-6, tri-stage schedule, bf16, best-WER checkpoint selection on FLEURS dev (best at step 20,000, dev WER 17.1).

Benchmarks

FLEURS tg_tj test, 599 clips, corpus-level jiwer after shared normalization:

model WER CER
omniASR-CTC-300M (no fine-tune) 19.74 5.62
5.8 h fine-tune (FLEURS + Common Voice) 17.34 4.88
this model 17.17 4.90

Held-out conversational YouTube test (1,625 clips from 157 whole videos excluded from training; machine-labeled references):

model WER CER
5.8 h fine-tune 49.89 18.88
this model 37.65 14.04

The conversational references are themselves Scribe-derived, so that table measures agreement with the labeling pipeline rather than human ground truth. The FLEURS table uses gold references.

Usage

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
# Register model.pt + the tokenizer as a fairseq2 asset card, then:
pipe = ASRInferencePipeline("your_card_name", device="cuda")
texts = pipe.transcribe([audio], lang=["tgk_Cyrl"], batch_size=8)

The checkpoint loads through fairseq2 0.8.x with the omnilingual-asr inference stack. Audio must be 16 kHz mono; clips above 40 s exceed the pipeline cap.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support