# 🗂️ Data Workspace - `raw/` incoming source files - `processed/` cleaned/aligned artifacts - `metadata/` manifests, speaker/dialect info, QA reports ## First Contribution (Normalization Starter) - `processed/normalization_seed_v0.1.tsv` starter normalization examples - `../docs/pashto_normalization_v0.1.md` baseline normalization policy - `../scripts/validate_normalization.py` basic file validator ## External Source: Mozilla Common Voice (Pashto) - Dataset: `Common Voice Scripted Speech 24.0 - Pashto` - Source page: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` - Local target path: `data/raw/common_voice_scripted_ps_v24/` - Integration guide: `../docs/common_voice_pashto_24.md` ### Notes - Keep raw downloaded dataset files out of git. - Track source URL + version in experiment notes for reproducibility. ## Validate Seed File ```bash python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv ```