ποΈ ASR Workspace
Place ASR baselines, training configs, and evaluation scripts here.
β Verified Pashto-Relevant ASR Models
π§ OpenAI Whisper Large v3
- Model: huggingface.co/openai/whisper-large-v3
- Pashto validation: OpenAI Whisper tokenizer map includes
"ps": "pashto". - Use in this repo: strong baseline and pseudo-labeling engine for bootstrapping.
- Applications: transcription, subtitle generation, dataset pre-labeling.
π Meta MMS Coverage (ASR + TTS language support)
- Coverage page: MMS language coverage
- Pashto validation: row includes
puswith ASR and TTS support. - Use in this repo: multilingual transfer baseline when Pashto data is limited.
- Applications: low-resource ASR transfer experiments.
βοΈ Verified Inference Tooling
π Faster-Whisper
- Repo: github.com/SYSTRAN/faster-whisper
- Why useful: optimized Whisper inference for faster experimentation.
- Use in this repo: local transcription pipelines and benchmark generation speedups.
π§© Integration Hints
- Keep all model/eval runs reproducible with command logs and commit hashes.
- Store evaluation outputs under benchmarks/ with model/version labels.
- Track WER/CER with dataset split and normalization policy references.