musaw commited on
Commit
7a9f810
Β·
1 Parent(s): fbc2945

docs(data): integrate Common Voice Pashto dataset and contribution guide

Browse files
Files changed (4) hide show
  1. CONTRIBUTING.md +15 -0
  2. README.md +12 -0
  3. data/README.md +11 -0
  4. docs/common_voice_pashto_24.md +52 -0
CONTRIBUTING.md CHANGED
@@ -8,6 +8,21 @@ Thanks for helping build open Pashto AI resources.
8
  - Model training/evaluation scripts
9
  - Documentation, issue triage, and testing
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ## πŸ”„ Contribution flow
12
  1. Open or pick an issue.
13
  2. Comment your plan.
 
8
  - Model training/evaluation scripts
9
  - Documentation, issue triage, and testing
10
 
11
+ ## 🌐 Mozilla Common Voice Path
12
+ You can contribute to Pashto data directly on Common Voice and connect it back
13
+ to this project.
14
+
15
+ Common Voice Pashto actions:
16
+ - Speak: `https://commonvoice.mozilla.org/ps/speak`
17
+ - Write: `https://commonvoice.mozilla.org/ps/write`
18
+ - Listen: `https://commonvoice.mozilla.org/ps/listen`
19
+ - Review: `https://commonvoice.mozilla.org/ps/review`
20
+
21
+ Then contribute here by opening an issue/PR with:
22
+ - what you worked on,
23
+ - what data quality gap you observed,
24
+ - what concrete follow-up is needed in this repository.
25
+
26
  ## πŸ”„ Contribution flow
27
  1. Open or pick an issue.
28
  2. Comment your plan.
README.md CHANGED
@@ -22,6 +22,18 @@ Community-led open-source project to make Pashto a first-class language in AI sp
22
  - Keep work reproducible, transparent, and contribution-friendly.
23
  - Focus on public good and broad accessibility.
24
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ## πŸš€ Start Here
26
  - πŸ“˜ Purpose: `PROJECT_PURPOSE.md`
27
  - 🀝 Contributing: `CONTRIBUTING.md`
 
22
  - Keep work reproducible, transparent, and contribution-friendly.
23
  - Focus on public good and broad accessibility.
24
 
25
+ ## πŸ“š Featured External Dataset
26
+ - `Common Voice Scripted Speech 24.0 - Pashto`
27
+ - Source:
28
+ `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
29
+ - Project integration guide: `docs/common_voice_pashto_24.md`
30
+
31
+ ## πŸ™Œ Contribute Through Mozilla Common Voice
32
+ - Speak: `https://commonvoice.mozilla.org/ps/speak`
33
+ - Write: `https://commonvoice.mozilla.org/ps/write`
34
+ - Listen: `https://commonvoice.mozilla.org/ps/listen`
35
+ - Review: `https://commonvoice.mozilla.org/ps/review`
36
+
37
  ## πŸš€ Start Here
38
  - πŸ“˜ Purpose: `PROJECT_PURPOSE.md`
39
  - 🀝 Contributing: `CONTRIBUTING.md`
data/README.md CHANGED
@@ -9,6 +9,17 @@
9
  - `../docs/pashto_normalization_v0.1.md` baseline normalization policy
10
  - `../scripts/validate_normalization.py` basic file validator
11
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## Validate Seed File
13
  ```bash
14
  python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
 
9
  - `../docs/pashto_normalization_v0.1.md` baseline normalization policy
10
  - `../scripts/validate_normalization.py` basic file validator
11
 
12
+ ## External Source: Mozilla Common Voice (Pashto)
13
+ - Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
14
+ - Source page:
15
+ `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
16
+ - Local target path: `data/raw/common_voice_scripted_ps_v24/`
17
+ - Integration guide: `../docs/common_voice_pashto_24.md`
18
+
19
+ ### Notes
20
+ - Keep raw downloaded dataset files out of git.
21
+ - Track source URL + version in experiment notes for reproducibility.
22
+
23
  ## Validate Seed File
24
  ```bash
25
  python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
docs/common_voice_pashto_24.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Common Voice Scripted Speech 24.0 - Pashto Integration Guide
2
+
3
+ This project recognizes Mozilla Common Voice as a major source for Pashto ASR
4
+ progress and community participation.
5
+
6
+ ## Dataset
7
+ - Name: `Common Voice Scripted Speech 24.0 - Pashto`
8
+ - Dataset page:
9
+ `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
10
+ - Release date: `2025-12-05`
11
+ - Format: `MP3` with TSV metadata
12
+ - Approximate size: `49.98 GB`
13
+ - License: `CC0-1.0`
14
+
15
+ ## Important Usage Rules
16
+ - Do not attempt to identify speakers.
17
+ - Do not re-host or re-share the raw dataset files.
18
+ - Keep provenance and version information when reporting experiments.
19
+
20
+ ## How To Use In This Repository
21
+ 1. Download from the official Mozilla Data Collective page.
22
+ 2. Extract locally under:
23
+ `data/raw/common_voice_scripted_ps_v24/`
24
+ 3. Keep raw audio out of git.
25
+ 4. Use project scripts/docs for normalization, splits, and benchmarking.
26
+
27
+ Recommended local structure:
28
+ ```text
29
+ data/raw/common_voice_scripted_ps_v24/
30
+ clips/
31
+ train.tsv
32
+ dev.tsv
33
+ test.tsv
34
+ ```
35
+
36
+ ## How To Contribute Through Mozilla Common Voice
37
+ Contributors can directly improve Pashto resources on Common Voice:
38
+ - Speak: `https://commonvoice.mozilla.org/ps/speak`
39
+ - Write: `https://commonvoice.mozilla.org/ps/write`
40
+ - Listen: `https://commonvoice.mozilla.org/ps/listen`
41
+ - Review: `https://commonvoice.mozilla.org/ps/review`
42
+
43
+ ## Contribution Loop Back To This Project
44
+ After contributing on Common Voice, open an issue/PR here and share:
45
+ - what task you worked on (speak/write/listen/review),
46
+ - what quality gaps you observed,
47
+ - what dataset or modeling step should be improved next.
48
+
49
+ Use issue labels:
50
+ - `data`
51
+ - `good first issue`
52
+ - `help wanted`