| # ποΈ Data Workspace |
|
|
| - [raw/](raw/) incoming source files |
| - [processed/](processed/) cleaned/aligned artifacts |
| - [metadata/](metadata/) manifests, speaker/dialect info, QA reports |
|
|
| ## β
Verified External Datasets |
|
|
| ### ποΈ Common Voice Scripted Speech 24.0 - Pashto |
| - Link: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) |
| - Why useful: largest open community Pashto speech source for ASR training and evaluation. |
| - How to use here: download to `data/raw/common_voice_scripted_ps_v24/` and follow [docs/common_voice_pashto_24.md](../docs/common_voice_pashto_24.md). |
|
|
| ### πΈ Google FLEURS (Pashto config) |
| - Link: [huggingface.co/datasets/google/fleurs](https://huggingface.co/datasets/google/fleurs) |
| - Pashto validation: [`fleurs.py` includes `"ps_af"`](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py). |
| - Why useful: standardized multilingual speech benchmark split for comparable ASR scores. |
| - How to use here: treat as external eval set for [benchmarks/](../benchmarks/README.md) and avoid training/eval leakage. |
|
|
| ### π OSCAR Corpus (Pashto web text) |
| - Link: [huggingface.co/datasets/oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) |
| - Pashto validation: dataset includes `unshuffled_deduplicated_ps`. |
| - Why useful: large-scale Pashto text for LM pretraining and lexicon expansion. |
| - How to use here: normalize and sample into [processed/](processed/) for NLP/ASR language model support. |
|
|
| ### π° Wikimedia Wikipedia (Pashto dump) |
| - Link: [huggingface.co/datasets/wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
| - Pashto validation: subset includes `20231101.ps`. |
| - Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance. |
| - How to use here: include as a high-quality text source in normalization and glossary workflows. |
|
|
| ### π Belebele (reading-comprehension benchmark) |
| - Link: [huggingface.co/datasets/facebook/belebele](https://huggingface.co/datasets/facebook/belebele) |
| - Pashto validation: subset includes `pbt_Arab`. |
| - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto. |
| - How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md). |
|
|
| ### π OPUS-100 (parallel text, en-ps) |
| - Link: [huggingface.co/datasets/Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) |
| - Pashto validation: dataset viewer includes `en-ps` subset. |
| - Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks. |
| - How to use here: keep in external eval/training split plans and log subset/version in run cards. |
|
|
| ### π€ Pashto Isolated Words Speech Dataset (Kaggle) |
| - Link: [kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) |
| - Pashto validation: dataset title is explicitly Pashto isolated-word speech. |
| - Why useful: useful for small-footprint ASR or keyword-spotting experiments. |
| - How to use here: treat as task-specific speech data and document licensing/collection assumptions before use. |
|
|
| ### π§ Pashto Word Embeddings (Kaggle) |
| - Link: [kaggle.com/datasets/drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) |
| - Pashto validation: dataset description states pretrained Pashto embeddings. |
| - Why useful: quick-start lexical semantics baseline for NLP experiments. |
| - How to use here: benchmark against transformer encoders in downstream Pashto tasks. |
|
|
| ## First Contribution (Normalization Starter) |
| - [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples |
| - [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy |
| - [scripts/validate_normalization.py](../scripts/validate_normalization.py) basic file validator |
|
|
| ## π§ͺ Validate Seed File |
| ```bash |
| python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv |
| ``` |
|
|
| ## π Notes |
| - Keep raw downloaded dataset files out of git. |
| - Track source URL + version in experiment notes for reproducibility. |
| - Re-check external links before every milestone release. |
|
|