musaw
docs(data): integrate Common Voice Pashto dataset and contribution guide
7a9f810
|
Raw
History Blame
982 Bytes

🗂️ Data Workspace

  • raw/ incoming source files
  • processed/ cleaned/aligned artifacts
  • metadata/ manifests, speaker/dialect info, QA reports

First Contribution (Normalization Starter)

  • processed/normalization_seed_v0.1.tsv starter normalization examples
  • ../docs/pashto_normalization_v0.1.md baseline normalization policy
  • ../scripts/validate_normalization.py basic file validator

External Source: Mozilla Common Voice (Pashto)

  • Dataset: Common Voice Scripted Speech 24.0 - Pashto
  • Source page: https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14
  • Local target path: data/raw/common_voice_scripted_ps_v24/
  • Integration guide: ../docs/common_voice_pashto_24.md

Notes

  • Keep raw downloaded dataset files out of git.
  • Track source URL + version in experiment notes for reproducibility.

Validate Seed File

python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv