# Common Voice Scripted Speech 24.0 - Pashto Integration Guide This project recognizes Mozilla Common Voice as a major source for Pashto ASR progress and community participation. ## Dataset - Name: `Common Voice Scripted Speech 24.0 - Pashto` - Dataset page: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` - Release date: `2025-12-05` - Format: `MP3` with TSV metadata - Approximate size: `49.98 GB` - License: `CC0-1.0` ## Important Usage Rules - Do not attempt to identify speakers. - Do not re-host or re-share the raw dataset files. - Keep provenance and version information when reporting experiments. ## How To Use In This Repository 1. Download from the official Mozilla Data Collective page. 2. Extract locally under: `data/raw/common_voice_scripted_ps_v24/` 3. Keep raw audio out of git. 4. Use project scripts/docs for normalization, splits, and benchmarking. Recommended local structure: ```text data/raw/common_voice_scripted_ps_v24/ clips/ train.tsv dev.tsv test.tsv ``` ## How To Contribute Through Mozilla Common Voice Contributors can directly improve Pashto resources on Common Voice: - Speak: `https://commonvoice.mozilla.org/ps/speak` - Write: `https://commonvoice.mozilla.org/ps/write` - Listen: `https://commonvoice.mozilla.org/ps/listen` - Review: `https://commonvoice.mozilla.org/ps/review` ## Contribution Loop Back To This Project After contributing on Common Voice, open an issue/PR here and share: - what task you worked on (speak/write/listen/review), - what quality gaps you observed, - what dataset or modeling step should be improved next. Use issue labels: - `data` - `good first issue` - `help wanted`