musaw commited on
Commit Β·
7a9f810
1
Parent(s): fbc2945
docs(data): integrate Common Voice Pashto dataset and contribution guide
Browse files- CONTRIBUTING.md +15 -0
- README.md +12 -0
- data/README.md +11 -0
- docs/common_voice_pashto_24.md +52 -0
CONTRIBUTING.md
CHANGED
|
@@ -8,6 +8,21 @@ Thanks for helping build open Pashto AI resources.
|
|
| 8 |
- Model training/evaluation scripts
|
| 9 |
- Documentation, issue triage, and testing
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
## π Contribution flow
|
| 12 |
1. Open or pick an issue.
|
| 13 |
2. Comment your plan.
|
|
|
|
| 8 |
- Model training/evaluation scripts
|
| 9 |
- Documentation, issue triage, and testing
|
| 10 |
|
| 11 |
+
## π Mozilla Common Voice Path
|
| 12 |
+
You can contribute to Pashto data directly on Common Voice and connect it back
|
| 13 |
+
to this project.
|
| 14 |
+
|
| 15 |
+
Common Voice Pashto actions:
|
| 16 |
+
- Speak: `https://commonvoice.mozilla.org/ps/speak`
|
| 17 |
+
- Write: `https://commonvoice.mozilla.org/ps/write`
|
| 18 |
+
- Listen: `https://commonvoice.mozilla.org/ps/listen`
|
| 19 |
+
- Review: `https://commonvoice.mozilla.org/ps/review`
|
| 20 |
+
|
| 21 |
+
Then contribute here by opening an issue/PR with:
|
| 22 |
+
- what you worked on,
|
| 23 |
+
- what data quality gap you observed,
|
| 24 |
+
- what concrete follow-up is needed in this repository.
|
| 25 |
+
|
| 26 |
## π Contribution flow
|
| 27 |
1. Open or pick an issue.
|
| 28 |
2. Comment your plan.
|
README.md
CHANGED
|
@@ -22,6 +22,18 @@ Community-led open-source project to make Pashto a first-class language in AI sp
|
|
| 22 |
- Keep work reproducible, transparent, and contribution-friendly.
|
| 23 |
- Focus on public good and broad accessibility.
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## π Start Here
|
| 26 |
- π Purpose: `PROJECT_PURPOSE.md`
|
| 27 |
- π€ Contributing: `CONTRIBUTING.md`
|
|
|
|
| 22 |
- Keep work reproducible, transparent, and contribution-friendly.
|
| 23 |
- Focus on public good and broad accessibility.
|
| 24 |
|
| 25 |
+
## π Featured External Dataset
|
| 26 |
+
- `Common Voice Scripted Speech 24.0 - Pashto`
|
| 27 |
+
- Source:
|
| 28 |
+
`https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
|
| 29 |
+
- Project integration guide: `docs/common_voice_pashto_24.md`
|
| 30 |
+
|
| 31 |
+
## π Contribute Through Mozilla Common Voice
|
| 32 |
+
- Speak: `https://commonvoice.mozilla.org/ps/speak`
|
| 33 |
+
- Write: `https://commonvoice.mozilla.org/ps/write`
|
| 34 |
+
- Listen: `https://commonvoice.mozilla.org/ps/listen`
|
| 35 |
+
- Review: `https://commonvoice.mozilla.org/ps/review`
|
| 36 |
+
|
| 37 |
## π Start Here
|
| 38 |
- π Purpose: `PROJECT_PURPOSE.md`
|
| 39 |
- π€ Contributing: `CONTRIBUTING.md`
|
data/README.md
CHANGED
|
@@ -9,6 +9,17 @@
|
|
| 9 |
- `../docs/pashto_normalization_v0.1.md` baseline normalization policy
|
| 10 |
- `../scripts/validate_normalization.py` basic file validator
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
## Validate Seed File
|
| 13 |
```bash
|
| 14 |
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
|
|
|
|
| 9 |
- `../docs/pashto_normalization_v0.1.md` baseline normalization policy
|
| 10 |
- `../scripts/validate_normalization.py` basic file validator
|
| 11 |
|
| 12 |
+
## External Source: Mozilla Common Voice (Pashto)
|
| 13 |
+
- Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
|
| 14 |
+
- Source page:
|
| 15 |
+
`https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
|
| 16 |
+
- Local target path: `data/raw/common_voice_scripted_ps_v24/`
|
| 17 |
+
- Integration guide: `../docs/common_voice_pashto_24.md`
|
| 18 |
+
|
| 19 |
+
### Notes
|
| 20 |
+
- Keep raw downloaded dataset files out of git.
|
| 21 |
+
- Track source URL + version in experiment notes for reproducibility.
|
| 22 |
+
|
| 23 |
## Validate Seed File
|
| 24 |
```bash
|
| 25 |
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
|
docs/common_voice_pashto_24.md
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Common Voice Scripted Speech 24.0 - Pashto Integration Guide
|
| 2 |
+
|
| 3 |
+
This project recognizes Mozilla Common Voice as a major source for Pashto ASR
|
| 4 |
+
progress and community participation.
|
| 5 |
+
|
| 6 |
+
## Dataset
|
| 7 |
+
- Name: `Common Voice Scripted Speech 24.0 - Pashto`
|
| 8 |
+
- Dataset page:
|
| 9 |
+
`https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
|
| 10 |
+
- Release date: `2025-12-05`
|
| 11 |
+
- Format: `MP3` with TSV metadata
|
| 12 |
+
- Approximate size: `49.98 GB`
|
| 13 |
+
- License: `CC0-1.0`
|
| 14 |
+
|
| 15 |
+
## Important Usage Rules
|
| 16 |
+
- Do not attempt to identify speakers.
|
| 17 |
+
- Do not re-host or re-share the raw dataset files.
|
| 18 |
+
- Keep provenance and version information when reporting experiments.
|
| 19 |
+
|
| 20 |
+
## How To Use In This Repository
|
| 21 |
+
1. Download from the official Mozilla Data Collective page.
|
| 22 |
+
2. Extract locally under:
|
| 23 |
+
`data/raw/common_voice_scripted_ps_v24/`
|
| 24 |
+
3. Keep raw audio out of git.
|
| 25 |
+
4. Use project scripts/docs for normalization, splits, and benchmarking.
|
| 26 |
+
|
| 27 |
+
Recommended local structure:
|
| 28 |
+
```text
|
| 29 |
+
data/raw/common_voice_scripted_ps_v24/
|
| 30 |
+
clips/
|
| 31 |
+
train.tsv
|
| 32 |
+
dev.tsv
|
| 33 |
+
test.tsv
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## How To Contribute Through Mozilla Common Voice
|
| 37 |
+
Contributors can directly improve Pashto resources on Common Voice:
|
| 38 |
+
- Speak: `https://commonvoice.mozilla.org/ps/speak`
|
| 39 |
+
- Write: `https://commonvoice.mozilla.org/ps/write`
|
| 40 |
+
- Listen: `https://commonvoice.mozilla.org/ps/listen`
|
| 41 |
+
- Review: `https://commonvoice.mozilla.org/ps/review`
|
| 42 |
+
|
| 43 |
+
## Contribution Loop Back To This Project
|
| 44 |
+
After contributing on Common Voice, open an issue/PR here and share:
|
| 45 |
+
- what task you worked on (speak/write/listen/review),
|
| 46 |
+
- what quality gaps you observed,
|
| 47 |
+
- what dataset or modeling step should be improved next.
|
| 48 |
+
|
| 49 |
+
Use issue labels:
|
| 50 |
+
- `data`
|
| 51 |
+
- `good first issue`
|
| 52 |
+
- `help wanted`
|