docs: add validated Pashto resource catalog and workspace guides

Files changed (10) hide show

CONTRIBUTING.md +14 -5
README.md +31 -23
apps/desktop/README.md +23 -0
asr/README.md +26 -0
benchmarks/README.md +28 -0
data/README.md +37 -12
docs/platforms.md +16 -8
docs/resource_catalog.md +58 -0
docs/workstreams.md +8 -0
tts/README.md +23 -0

CONTRIBUTING.md CHANGED Viewed

@@ -2,15 +2,15 @@
 Thanks for helping build open Pashto AI resources.
-## 🧩 Ways to contribute
 - Data recording and validation
 - Text normalization and terminology fixes
 - Model training/evaluation scripts
 - Documentation, issue triage, and testing
 ## 🌐 Mozilla Common Voice Path
-You can contribute to Pashto data directly on Common Voice and connect it back
-to this project.
 Common Voice Pashto actions:
 - Speak: `https://commonvoice.mozilla.org/ps/speak`
@@ -23,7 +23,16 @@ Then contribute here by opening an issue/PR with:
 - what data quality gap you observed,
 - what concrete follow-up is needed in this repository.
-## 🔄 Contribution flow
 1. Open or pick an issue.
 2. Comment your plan.
 3. Create a branch and make focused changes.
@@ -35,7 +44,7 @@ Then contribute here by opening an issue/PR with:
 - Document assumptions, limitations, and risks.
 - Respect contributors and community guidelines.
-## 🏷️ Priority labels (recommended)
 - `good first issue`
 - `data`
 - `asr`

 Thanks for helping build open Pashto AI resources.
+## 🧩 Ways to Contribute
 - Data recording and validation
 - Text normalization and terminology fixes
 - Model training/evaluation scripts
 - Documentation, issue triage, and testing
+- External resource discovery and validation
 ## 🌐 Mozilla Common Voice Path
+You can contribute to Pashto data directly on Common Voice and connect it back to this project.
 Common Voice Pashto actions:
 - Speak: `https://commonvoice.mozilla.org/ps/speak`
 - what data quality gap you observed,
 - what concrete follow-up is needed in this repository.
+## 🔍 External Resource Contribution Rules
+- Add links in the correct workspace README (`data`, `asr`, `tts`, `benchmarks`, `apps`).
+- Update `docs/resource_catalog.md` with:
+  - what the resource is,
+  - explicit Pashto support evidence,
+  - how it can be used in this repository,
+  - practical applications.
+- Prefer official pages and model/dataset cards over third-party reposts.
+## 🔄 Contribution Flow
 1. Open or pick an issue.
 2. Comment your plan.
 3. Create a branch and make focused changes.
 - Document assumptions, limitations, and risks.
 - Respect contributors and community guidelines.
+## 🏷️ Priority Labels (Recommended)
 - `good first issue`
 - `data`
 - `asr`

README.md CHANGED Viewed

@@ -22,11 +22,31 @@ Community-led open-source project to make Pashto a first-class language in AI sp
 - Keep work reproducible, transparent, and contribution-friendly.
 - Focus on public good and broad accessibility.
-## 📚 Featured External Dataset
-- Common Voice Scripted Speech 24.0 - Pashto
-- Source:
-  [https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
-- Project integration guide: [docs/common_voice_pashto_24.md](docs/common_voice_pashto_24.md)
 ## 🙌 Contribute Through Mozilla Common Voice
 - Speak: [https://commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak)
@@ -34,21 +54,9 @@ Community-led open-source project to make Pashto a first-class language in AI sp
 - Listen: [https://commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen)
 - Review: [https://commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review)
-## 🌐 Community Resource Profiles
-- Hugging Face (external Pashto resource profile): [https://huggingface.co/ihanif](https://huggingface.co/ihanif)
-- Use this profile as a reference point for Pashto ASR/TTS datasets, models, and
-  community experiments.
-## 🚀 Start Here
-- 📘 Purpose: `PROJECT_PURPOSE.md`
-- 🤝 Contributing: `CONTRIBUTING.md`
-- 🗺️ Roadmap: `ROADMAP.md`
-- 🏛️ Governance: `GOVERNANCE.md`
-- 💬 Community coordination: `community/COMMUNICATION.md`
-## 🧩 Initial Workstreams
-- `data/` Pashto data collection, cleaning, metadata
-- `asr/` speech-to-text baselines and experiments
-- `tts/` text-to-speech baselines and experiments
-- `benchmarks/` fixed test sets and evaluation scripts
-- `apps/desktop/` app integration references

 - Keep work reproducible, transparent, and contribution-friendly.
 - Focus on public good and broad accessibility.
+## 🧭 Documentation Map
+- Purpose: `PROJECT_PURPOSE.md`
+- Contributing: `CONTRIBUTING.md`
+- Roadmap: `ROADMAP.md`
+- Governance: `GOVERNANCE.md`
+- Community: `community/COMMUNICATION.md`
+- Release process: `docs/release_process.md`
+- Workstreams: `docs/workstreams.md`
+- Verified external resources: `docs/resource_catalog.md`
+## 📚 Verified Resource Catalog
+The project now tracks validated external resources in one place:
+- `docs/resource_catalog.md`
+This catalog includes:
+- Datasets
+- Models
+- Benchmarks
+- Tools and applications
+- Validation notes and integration guidance
+## 🎙️ Featured Dataset: Common Voice Pashto
+- Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
+- Source: [https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
+- Integration guide: `docs/common_voice_pashto_24.md`
 ## 🙌 Contribute Through Mozilla Common Voice
 - Speak: [https://commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak)
 - Listen: [https://commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen)
 - Review: [https://commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review)
+## 🧩 Workspaces
+- `data/` datasets, curation, metadata, quality
+- `asr/` ASR baselines and experiments
+- `tts/` TTS baselines and experiments
+- `benchmarks/` benchmark sets and evaluation
+- `apps/desktop/` user-facing integration references

apps/desktop/README.md CHANGED Viewed

@@ -1,3 +1,26 @@
 # 🖥️ Desktop Integration
 Tracks desktop app integration for ASR/TTS/translation pipelines.

 # 🖥️ Desktop Integration
 Tracks desktop app integration for ASR/TTS/translation pipelines.
+## ✅ Verified Application Building Blocks
+### 🎤 Speech Input: Faster-Whisper
+- Repo: `https://github.com/SYSTRAN/faster-whisper`
+- Use in apps: fast offline/near-real-time transcription components.
+### 🔈 Speech Output: Coqui TTS
+- Repo: `https://github.com/coqui-ai/TTS`
+- Use in apps: local speech synthesis modules for Pashto-enabled UX.
+### 🌍 Translation Layer: OPUS MT (via multilingual models)
+- Models:
+  - `https://huggingface.co/Helsinki-NLP/opus-mt-en-mul`
+  - `https://huggingface.co/Helsinki-NLP/opus-mt-mul-en`
+- Pashto validation: language list includes `pus`.
+- Use in apps: Pashto↔English assistive translation path for demos.
+## 🧩 Suggested Desktop Pipeline
+1. Mic input → ASR transcription
+2. Optional translation (Pashto ↔ English)
+3. Optional TTS playback in Pashto
+4. Save logs for QA and benchmark replay

asr/README.md CHANGED Viewed

@@ -1,3 +1,29 @@
 # 🎙️ ASR Workspace
 Place ASR baselines, training configs, and evaluation scripts here.

 # 🎙️ ASR Workspace
 Place ASR baselines, training configs, and evaluation scripts here.
+## ✅ Verified Pashto-Relevant ASR Models
+### 🧠 OpenAI Whisper Large v3
+- Model: `https://huggingface.co/openai/whisper-large-v3`
+- Pashto validation: Whisper tokenizer language map includes `"ps": "pashto"`.
+- Use in this repo: strong baseline and pseudo-labeling engine for bootstrapping.
+- Applications: transcription, subtitle generation, dataset pre-labeling.
+### 🌐 Meta MMS Coverage (ASR + TTS language support)
+- Coverage page: `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html`
+- Pashto validation: row includes `pus` with ASR and TTS support.
+- Use in this repo: multilingual transfer baseline when Pashto data is limited.
+- Applications: low-resource ASR transfer experiments.
+## ⚙️ Verified Inference Tooling
+### 🚀 Faster-Whisper
+- Repo: `https://github.com/SYSTRAN/faster-whisper`
+- Why useful: optimized Whisper inference for faster experimentation.
+- Use in this repo: local transcription pipelines and benchmark generation speedups.
+## 🧩 Integration Hints
+- Keep all model/eval runs reproducible with command logs and commit hashes.
+- Store evaluation outputs under `benchmarks/` with model/version labels.
+- Track WER/CER with dataset split and normalization policy references.

benchmarks/README.md CHANGED Viewed

@@ -1,3 +1,31 @@
 # 🧪 Benchmarks
 Define fixed test sets, metrics, and leaderboard generation scripts.

 # 🧪 Benchmarks
 Define fixed test sets, metrics, and leaderboard generation scripts.
+## ✅ Verified Benchmark Sources
+### 🌸 FLEURS (Pashto speech benchmark)
+- Dataset: `https://huggingface.co/datasets/google/fleurs`
+- Pashto validation: `fleurs.py` includes `ps_af`.
+- Primary use: multilingual ASR benchmark with fixed split conventions.
+### 📘 Belebele (Pashto reading benchmark)
+- Dataset: `https://huggingface.co/datasets/facebook/belebele`
+- Pashto validation: subset includes `pbt_Arab`.
+- Primary use: comprehension benchmark for multilingual NLP models.
+### 🗣️ Common Voice Pashto v24
+- Dataset: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
+- Primary use: ASR train/dev/test experiments and project baseline tracking.
+## 📏 Recommended Metrics
+- ASR: `WER`, `CER`
+- TTS: `MCD`/objective proxies + human MOS-style scoring
+- NLP: task-specific accuracy/F1 with fixed test set
+## 🧾 Reporting Template
+- Benchmark dataset + version
+- Model + checkpoint version
+- Normalization policy version
+- Metrics and error analysis summary
+- Reproducible command/config reference

data/README.md CHANGED Viewed

@@ -4,23 +4,48 @@
 - `processed/` cleaned/aligned artifacts
 - `metadata/` manifests, speaker/dialect info, QA reports
 ## First Contribution (Normalization Starter)
 - `processed/normalization_seed_v0.1.tsv` starter normalization examples
 - `../docs/pashto_normalization_v0.1.md` baseline normalization policy
 - `../scripts/validate_normalization.py` basic file validator
-## External Source: Mozilla Common Voice (Pashto)
-- Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
-- Source page:
-  `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
-- Local target path: `data/raw/common_voice_scripted_ps_v24/`
-- Integration guide: `../docs/common_voice_pashto_24.md`
-### Notes
-- Keep raw downloaded dataset files out of git.
-- Track source URL + version in experiment notes for reproducibility.
-## Validate Seed File
 ```bash
 python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
 ```

 - `processed/` cleaned/aligned artifacts
 - `metadata/` manifests, speaker/dialect info, QA reports
+## ✅ Verified External Datasets
+### 🎙️ Common Voice Scripted Speech 24.0 - Pashto
+- Link: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
+- Why useful: largest open community Pashto speech source for ASR training and evaluation.
+- How to use here: download to `data/raw/common_voice_scripted_ps_v24/` and follow `../docs/common_voice_pashto_24.md`.
+### 🌸 Google FLEURS (Pashto config)
+- Link: `https://huggingface.co/datasets/google/fleurs`
+- Pashto validation: `fleurs.py` includes `"ps_af"`.
+- Why useful: standardized multilingual speech benchmark split for comparable ASR scores.
+- How to use here: treat as external eval set for `benchmarks/` and avoid training/eval leakage.
+### 📖 OSCAR Corpus (Pashto web text)
+- Link: `https://huggingface.co/datasets/oscar-corpus/oscar`
+- Pashto validation: dataset includes `unshuffled_deduplicated_ps`.
+- Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
+- How to use here: normalize and sample into `data/processed/` for NLP/ASR language model support.
+### 📰 Wikimedia Wikipedia (Pashto dump)
+- Link: `https://huggingface.co/datasets/wikimedia/wikipedia`
+- Pashto validation: subset includes `20231101.ps`.
+- Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
+- How to use here: include as a high-quality text source in normalization and glossary workflows.
+### 📘 Belebele (reading-comprehension benchmark)
+- Link: `https://huggingface.co/datasets/facebook/belebele`
+- Pashto validation: subset includes `pbt_Arab`.
+- Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
+- How to use here: benchmark multilingual encoders and track improvements in `benchmarks/`.
 ## First Contribution (Normalization Starter)
 - `processed/normalization_seed_v0.1.tsv` starter normalization examples
 - `../docs/pashto_normalization_v0.1.md` baseline normalization policy
 - `../scripts/validate_normalization.py` basic file validator
+## 🧪 Validate Seed File
 ```bash
 python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
 ```
+## 📝 Notes
+- Keep raw downloaded dataset files out of git.
+- Track source URL + version in experiment notes for reproducibility.
+- Re-check external links before every milestone release.

docs/platforms.md CHANGED Viewed

@@ -1,15 +1,23 @@
 # 🌐 Platforms
-## 🧭 Primary platforms
 - GitHub: code, issues, pull requests, releases
 - Hugging Face Hub: models, datasets, demos
 - Community chat (Discord/Matrix): contributor coordination
-## 📣 Publishing expectations
-- Every release links to changelog + benchmark snapshot
-- Every model links to dataset provenance and eval metrics
-## 🤝 Community resource profiles
-- Hugging Face Pashto resource profile (external): `https://huggingface.co/ihanif`
-- Contributors can review this profile for reference datasets/models related to
-  Pashto inclusion work.

 # 🌐 Platforms
+## 🧭 Primary Platforms
 - GitHub: code, issues, pull requests, releases
 - Hugging Face Hub: models, datasets, demos
 - Community chat (Discord/Matrix): contributor coordination
+## 📚 Resource Discovery and Validation
+- Use `docs/resource_catalog.md` as the single source of truth for validated external resources.
+- Add new links only after checking official pages and explicit Pashto support markers.
+## 📣 Publishing Expectations
+- Every release links to changelog + benchmark snapshot.
+- Every model links to dataset provenance and eval metrics.
+- Every new external link must include use-case notes and where it belongs in repo structure.
+## 🚀 Dual Publish Checklist (GitHub + Hugging Face)
+1. `git status` is clean except intended changes.
+2. Docs and resource links updated.
+3. Commit created with clear scope.
+4. Push to `origin` (GitHub).
+5. Push to `hf` (Hugging Face).
+6. Verify README render and link health on both platforms.

docs/resource_catalog.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# 📚 Verified Pashto Resource Catalog
+Last updated: `2026-02-14`
+This catalog lists external resources validated for Pashto relevance and possible integration in this repository.
+## ✅ Validation Method
+- Confirmed the source URL resolves to the official page.
+- Confirmed Pashto support by explicit code/name on the page where available (`ps`, `ps_af`, `pbt_Arab`, `pus`).
+- Added only resources with clear practical use for this repo (data, models, benchmarks, apps).
+## 🗂️ Datasets
+| Resource | Link | Pashto Validation | How to Use Here | Applications |
+|---|---|---|---|---|
+| Common Voice Scripted Speech 24.0 - Pashto | `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` | Official Pashto dataset page | ASR training/eval under `data/raw/common_voice_scripted_ps_v24/` | ASR, pronunciation analysis, data bootstrapping |
+| Google FLEURS | `https://huggingface.co/datasets/google/fleurs` | `fleurs.py` includes `ps_af` | External benchmark split in `benchmarks/` | Multilingual ASR benchmarking |
+| OSCAR Corpus | `https://huggingface.co/datasets/oscar-corpus/oscar` | Includes `unshuffled_deduplicated_ps` | Text LM and normalization support in `data/processed/` | NLP pretraining, lexicon growth |
+| Wikimedia Wikipedia | `https://huggingface.co/datasets/wikimedia/wikipedia` | Includes `20231101.ps` | High-quality text source and terminology checks | NLP, glossary, language modeling |
+| Belebele | `https://huggingface.co/datasets/facebook/belebele` | Includes `pbt_Arab` subset | Comprehension benchmark in `benchmarks/` | Multilingual reading-comprehension eval |
+## 🤖 Models
+| Resource | Link | Pashto Validation | How to Use Here | Applications |
+|---|---|---|---|---|
+| Whisper Large v3 | `https://huggingface.co/openai/whisper-large-v3` | Whisper tokenizer language map contains `"ps": "pashto"` | ASR baseline in `asr/` | Transcription, subtitle generation |
+| MMS language coverage | `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html` | Row for `pus` shows ASR/TTS support | Compare multilingual transfer baselines | Low-resource ASR/TTS transfer |
+| MMS TTS model collection | `https://huggingface.co/facebook/mms-tts` | Official MMS TTS collection aligned with coverage table | Evaluate multilingual Pashto TTS checkpoints in `tts/` | Speech synthesis and assistive audio |
+| NLLB-200 Distilled 600M | `https://huggingface.co/facebook/nllb-200-distilled-600M` | `special_tokens_map.json` contains `pbt_Arab` | Baseline translation experiments under `apps/` and `benchmarks/` | Pashto-centered MT pipelines |
+| OPUS MT en→mul | `https://huggingface.co/Helsinki-NLP/opus-mt-en-mul` | Language list includes `pus` | Pashto translation baseline via multilingual target | Translation in demos and tooling |
+| OPUS MT mul→en | `https://huggingface.co/Helsinki-NLP/opus-mt-mul-en` | Language list includes `pus` | Reverse translation baseline | Translation and bilingual UX |
+## 🧪 Benchmarks and Evaluation
+| Resource | Link | Recommended Metric Focus |
+|---|---|---|
+| FLEURS (speech) | `https://huggingface.co/datasets/google/fleurs` | WER, CER |
+| Common Voice Pashto | `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` | WER, CER, error buckets |
+| Belebele (reading) | `https://huggingface.co/datasets/facebook/belebele` | Accuracy/F1 |
+## 🖥️ Applications and Tooling
+| Resource | Link | How to Use Here |
+|---|---|---|
+| Faster-Whisper | `https://github.com/SYSTRAN/faster-whisper` | Production-style and local ASR inference speedups |
+| Coqui TTS | `https://github.com/coqui-ai/TTS` | Train/fine-tune Pashto TTS and run desktop synthesis |
+## 📄 Research Anchors (for reading and citation)
+- Whisper paper: `https://arxiv.org/abs/2212.04356`
+- MMS paper: `https://arxiv.org/abs/2305.13516`
+- NLLB paper: `https://arxiv.org/abs/2207.04672`
+- FLEURS paper: `https://arxiv.org/abs/2205.12446`
+## 🔄 Maintenance Rule
+Before each release, re-open each external link and confirm:
+- Resource still exists.
+- Pashto support marker is unchanged.
+- License/usage terms are still compatible with this project.

docs/workstreams.md CHANGED Viewed

@@ -2,15 +2,23 @@
 ## 🗂️ Data
 - Collection guides, consent, validation, and metadata policy.
 ## 🎙️ ASR
 - Baselines, fine-tuning recipes, and evaluation scripts.
 ## 🔊 TTS
 - Baselines, speaker/style control, and quality assessment.
 ## 🧪 Benchmarks
 - Fixed test set, metric definitions, and leaderboard process.
 ## 🖥️ Applications
 - Desktop and API integrations for real-user testing.

 ## 🗂️ Data
 - Collection guides, consent, validation, and metadata policy.
+- External dataset references: `data/README.md`
 ## 🎙️ ASR
 - Baselines, fine-tuning recipes, and evaluation scripts.
+- External ASR models/tools: `asr/README.md`
 ## 🔊 TTS
 - Baselines, speaker/style control, and quality assessment.
+- External TTS models/tools: `tts/README.md`
 ## 🧪 Benchmarks
 - Fixed test set, metric definitions, and leaderboard process.
+- Benchmark resources: `benchmarks/README.md`
 ## 🖥️ Applications
 - Desktop and API integrations for real-user testing.
+- Integration resources: `apps/desktop/README.md`
+## 📚 Master Resource Index
+- Full validated list: `docs/resource_catalog.md`

tts/README.md CHANGED Viewed

@@ -1,3 +1,26 @@
 # 🔊 TTS Workspace
 Place TTS baselines, training configs, and quality-evaluation scripts here.

 # 🔊 TTS Workspace
 Place TTS baselines, training configs, and quality-evaluation scripts here.
+## ✅ Verified Pashto-Relevant TTS Resources
+### 🌐 Meta MMS Coverage (ASR + TTS language support)
+- Coverage page: `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html`
+- Pashto validation: row includes `pus` with TTS support.
+- Use in this repo: multilingual transfer baseline for Pashto synthesis.
+- Applications: baseline voice generation, pronunciation checks, accessibility tools.
+### 🧪 Meta MMS TTS Model Collection
+- Model card: `https://huggingface.co/facebook/mms-tts`
+- Why useful: broad multilingual TTS package with language-specific checkpoints.
+- Use in this repo: evaluate Pashto synthesis quality against curated text prompts.
+### 🛠️ Coqui TTS Toolkit
+- Repo: `https://github.com/coqui-ai/TTS`
+- Why useful: open training/inference toolkit to fine-tune custom Pashto voices.
+- Use in this repo: reproducible TTS training scripts and quality A/B experiments.
+## 🧩 Integration Hints
+- Keep text normalization consistent between ASR and TTS experiments.
+- Pair objective metrics with human listening checks in benchmark notes.
+- Document voice style, speaker metadata, and license provenance for every release.