🗂️ Data Workspace
raw/incoming source filesprocessed/cleaned/aligned artifactsmetadata/manifests, speaker/dialect info, QA reports
First Contribution (Normalization Starter)
processed/normalization_seed_v0.1.tsvstarter normalization examples../docs/pashto_normalization_v0.1.mdbaseline normalization policy../scripts/validate_normalization.pybasic file validator
External Source: Mozilla Common Voice (Pashto)
- Dataset:
Common Voice Scripted Speech 24.0 - Pashto - Source page:
https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14 - Local target path:
data/raw/common_voice_scripted_ps_v24/ - Integration guide:
../docs/common_voice_pashto_24.md
Notes
- Keep raw downloaded dataset files out of git.
- Track source URL + version in experiment notes for reproducibility.
Validate Seed File
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv