CDLI Parakeet TDT 0.6B English Fine-Tune (Uganda + Kenya)

Model architecture Base model Language

This repository contains a NeMo ASR model fine-tuned from nvidia/parakeet-tdt-0.6b-v2 on a merged atypical-English training mix from Uganda and Kenya.

The setup mirrors the standalone Ugandan English Parakeet notebook, but extends training to both countries while keeping model-selection validation on Uganda only. Test evaluation is then reported separately for Uganda and Kenya.

Model Details

  • Base model: nvidia/parakeet-tdt-0.6b-v2
  • Fine-tuning framework: NVIDIA NeMo
  • Language: English
  • Acoustic model family: FastConformer-TDT / RNNT-BPE
  • Output text: lower-case English transcription with standard ASR normalization

Datasets

  • Uganda: cdli/ugandan_english_nonstandard_speech_v1.0
  • Kenya: cdli/kenyan_english_nonstandard_speech_v1.0
  • License: cc-by-sa-4.0
  • Audio sampling rate: 16 kHz

Split sizes used in this run:

  • Uganda train: 5175
  • Uganda validation: 638
  • Uganda test: 1016
  • Kenya train: 4374
  • Kenya validation: 542
  • Kenya test: 928
  • Merged train total: 9549 rows, 56.46 hours
  • Validation selection set: Uganda only, 638 rows, 4.33 hours

The underlying CDLI datasets include atypical or non-standard speech and speaker metadata such as severity of speech impairment, disorder type, age, gender, and etiology.

Training Configuration

  • Work root: /jupyter_kernel/parakeet_cdli_en_ug_ke
  • Train mix: Uganda + Kenya
  • Primary validation country: Uganda
  • Max manifest audio length: 40.0 s
  • Max training audio length: 40.0 s
  • Min audio length: 0.2 s
  • Train batch size: 8
  • Eval batch size: 8
  • Gradient accumulation steps: 8
  • Effective train batch size: 64
  • Learning rate: 5e-5
  • Weight decay: 1e-3
  • Warmup steps: 100
  • Scheduler: CosineAnnealing
  • Max steps: 20000
  • Validation interval: 200 steps
  • Early stopping patience: 10
  • Precision: bf16-mixed when supported, otherwise mixed precision fallback

Evaluation

Evaluation was run separately on the held-out Uganda and Kenya test splits using both raw transcript comparison and normalized transcript comparison.

Uganda Test Set

  • Raw WER: 27.58%
  • Raw CER: 14.17%
  • Normalized WER: 21.33%
  • Normalized CER: 12.81%
  • Average normalized utterance WER (capped at 1.0): 20.77%
  • Average normalized utterance CER (capped at 1.0): 12.89%

Kenya Test Set

  • Raw WER: 26.36%
  • Raw CER: 11.78%
  • Normalized WER: 14.56%
  • Normalized CER: 9.46%
  • Average normalized utterance WER (capped at 1.0): 14.49%
  • Average normalized utterance CER (capped at 1.0): 9.20%

Uganda and Kenya Comparison

The same shared model performs materially better on the Kenyan test split than on the Ugandan test split.

  • Kenya vs Uganda normalized WER: 14.56% vs 21.33%
  • Kenya vs Uganda normalized CER: 9.46% vs 12.81%
  • Absolute normalized WER gap: 6.77 points in favor of Kenya

Relative to the Uganda-only English Parakeet 0.6B run (KasuleTrevor/cdli-parakeet-en-finetune), the Uganda score in this mixed Uganda+Kenya run improved slightly:

  • Uganda-only normalized WER: 21.72%
  • UG+KE model on Uganda normalized WER: 21.33%
  • Absolute Uganda improvement: 0.39 WER points

That means the mixed-country training did not hurt Uganda selection-time performance and appears to improve cross-country generalization, with the strongest gains visible on Kenya.

Error Analysis Highlights

Per-country grouped analysis files were produced for:

  • severity
  • disorder type
  • etiology
  • speaker

High-level patterns from the notebook outputs:

  • Uganda showed the expected severity gradient, with mild speakers performing best and severe speakers worst.
  • Kenya also showed a severity gradient, but absolute scores were better than Uganda across all three severity bands.
  • In Kenya, the best disorder bucket was Dysphonia and the hardest was the mixed Stuttering (Disfluency Disorders), Dysarthria bucket.
  • In Uganda, Voice disorder had the highest mean WER among the reported disorder groups.

Usage

from nemo.collections.asr.models import ASRModel

model = ASRModel.from_pretrained("KasuleTrevor/cdli-parakeet-en-ug-ke")
predictions = model.transcribe(["path/to/audio.wav"])
print(predictions[0].text if hasattr(predictions[0], "text") else predictions[0])

Files

  • EN-PARAKEET-TDT-UG-KE.nemo: exported NeMo checkpoint
  • results/EN-PARAKEET-TDT-UG-KE/: uploaded result tables and breakdown CSVs
  • train_mix_summary_ug_ke.json
  • country_summary_ug_ke.csv
  • test_predictions_scored_uganda.csv
  • test_predictions_scored_kenya.csv
  • severity_breakdown_combined_ug_ke.csv
  • disorder_breakdown_combined_ug_ke.csv
  • etiology_breakdown_combined_ug_ke.csv

Notes

  • Validation for early stopping and checkpoint selection was Uganda-only, even though training used both Uganda and Kenya.
  • Source datasets are gated. Review the dataset terms before requesting access.
  • Result artifacts for both countries were uploaded to the Hub under results/EN-PARAKEET-TDT-UG-KE/.
Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train KasuleTrevor/cdli-parakeet-en-ug-ke

Collection including KasuleTrevor/cdli-parakeet-en-ug-ke

Evaluation results