# Common Voice Scripted Speech 24.0 - Pashto Integration Guide This project recognizes Mozilla Common Voice as a major source for Pashto ASR progress and community participation. ## Dataset - Name: Common Voice Scripted Speech 24.0 - Pashto - Dataset page: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) - Release date: `2025-12-05` - Format: `MP3` with TSV metadata - Approximate size: `49.98 GB` - License: `CC0-1.0` ## Important Usage Rules - Do not attempt to identify speakers. - Do not re-host or re-share the raw dataset files. - Keep provenance and version information when reporting experiments. ## How To Use In This Repository 1. Download from the official Mozilla Data Collective page. 2. Extract locally under: `data/raw/common_voice_scripted_ps_v24/` 3. Keep raw audio out of git. 4. Use project scripts/docs for normalization, splits, and benchmarking. Recommended local structure: ```text data/raw/common_voice_scripted_ps_v24/ clips/ train.tsv dev.tsv test.tsv ``` ## How To Contribute Through Mozilla Common Voice Contributors can directly improve Pashto resources on Common Voice: - Speak: [commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak) - Write: [commonvoice.mozilla.org/ps/write](https://commonvoice.mozilla.org/ps/write) - Listen: [commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen) - Review: [commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review) ## Contribution Loop Back To This Project After contributing on Common Voice, open an issue/PR here and share: - what task you worked on (speak/write/listen/review), - what quality gaps you observed, - what dataset or modeling step should be improved next. Use issue labels: - `data` - `good first issue` - `help wanted`