File size: 982 Bytes
1ad58b4 f725a8a 379266c 7a9f810 379266c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | # 🗂️ Data Workspace
- `raw/` incoming source files
- `processed/` cleaned/aligned artifacts
- `metadata/` manifests, speaker/dialect info, QA reports
## First Contribution (Normalization Starter)
- `processed/normalization_seed_v0.1.tsv` starter normalization examples
- `../docs/pashto_normalization_v0.1.md` baseline normalization policy
- `../scripts/validate_normalization.py` basic file validator
## External Source: Mozilla Common Voice (Pashto)
- Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
- Source page:
`https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
- Local target path: `data/raw/common_voice_scripted_ps_v24/`
- Integration guide: `../docs/common_voice_pashto_24.md`
### Notes
- Keep raw downloaded dataset files out of git.
- Track source URL + version in experiment notes for reproducibility.
## Validate Seed File
```bash
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
```
|