musaw commited on
Commit
0052610
Β·
1 Parent(s): 4f4e013

docs: add validated Pashto resource catalog and workspace guides

Browse files
CONTRIBUTING.md CHANGED
@@ -2,15 +2,15 @@
2
 
3
  Thanks for helping build open Pashto AI resources.
4
 
5
- ## 🧩 Ways to contribute
6
  - Data recording and validation
7
  - Text normalization and terminology fixes
8
  - Model training/evaluation scripts
9
  - Documentation, issue triage, and testing
 
10
 
11
  ## 🌐 Mozilla Common Voice Path
12
- You can contribute to Pashto data directly on Common Voice and connect it back
13
- to this project.
14
 
15
  Common Voice Pashto actions:
16
  - Speak: `https://commonvoice.mozilla.org/ps/speak`
@@ -23,7 +23,16 @@ Then contribute here by opening an issue/PR with:
23
  - what data quality gap you observed,
24
  - what concrete follow-up is needed in this repository.
25
 
26
- ## πŸ”„ Contribution flow
 
 
 
 
 
 
 
 
 
27
  1. Open or pick an issue.
28
  2. Comment your plan.
29
  3. Create a branch and make focused changes.
@@ -35,7 +44,7 @@ Then contribute here by opening an issue/PR with:
35
  - Document assumptions, limitations, and risks.
36
  - Respect contributors and community guidelines.
37
 
38
- ## 🏷️ Priority labels (recommended)
39
  - `good first issue`
40
  - `data`
41
  - `asr`
 
2
 
3
  Thanks for helping build open Pashto AI resources.
4
 
5
+ ## 🧩 Ways to Contribute
6
  - Data recording and validation
7
  - Text normalization and terminology fixes
8
  - Model training/evaluation scripts
9
  - Documentation, issue triage, and testing
10
+ - External resource discovery and validation
11
 
12
  ## 🌐 Mozilla Common Voice Path
13
+ You can contribute to Pashto data directly on Common Voice and connect it back to this project.
 
14
 
15
  Common Voice Pashto actions:
16
  - Speak: `https://commonvoice.mozilla.org/ps/speak`
 
23
  - what data quality gap you observed,
24
  - what concrete follow-up is needed in this repository.
25
 
26
+ ## πŸ” External Resource Contribution Rules
27
+ - Add links in the correct workspace README (`data`, `asr`, `tts`, `benchmarks`, `apps`).
28
+ - Update `docs/resource_catalog.md` with:
29
+ - what the resource is,
30
+ - explicit Pashto support evidence,
31
+ - how it can be used in this repository,
32
+ - practical applications.
33
+ - Prefer official pages and model/dataset cards over third-party reposts.
34
+
35
+ ## πŸ”„ Contribution Flow
36
  1. Open or pick an issue.
37
  2. Comment your plan.
38
  3. Create a branch and make focused changes.
 
44
  - Document assumptions, limitations, and risks.
45
  - Respect contributors and community guidelines.
46
 
47
+ ## 🏷️ Priority Labels (Recommended)
48
  - `good first issue`
49
  - `data`
50
  - `asr`
README.md CHANGED
@@ -22,11 +22,31 @@ Community-led open-source project to make Pashto a first-class language in AI sp
22
  - Keep work reproducible, transparent, and contribution-friendly.
23
  - Focus on public good and broad accessibility.
24
 
25
- ## πŸ“š Featured External Dataset
26
- - Common Voice Scripted Speech 24.0 - Pashto
27
- - Source:
28
- [https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
29
- - Project integration guide: [docs/common_voice_pashto_24.md](docs/common_voice_pashto_24.md)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## πŸ™Œ Contribute Through Mozilla Common Voice
32
  - Speak: [https://commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak)
@@ -34,21 +54,9 @@ Community-led open-source project to make Pashto a first-class language in AI sp
34
  - Listen: [https://commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen)
35
  - Review: [https://commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review)
36
 
37
- ## 🌐 Community Resource Profiles
38
- - Hugging Face (external Pashto resource profile): [https://huggingface.co/ihanif](https://huggingface.co/ihanif)
39
- - Use this profile as a reference point for Pashto ASR/TTS datasets, models, and
40
- community experiments.
41
-
42
- ## πŸš€ Start Here
43
- - πŸ“˜ Purpose: `PROJECT_PURPOSE.md`
44
- - 🀝 Contributing: `CONTRIBUTING.md`
45
- - πŸ—ΊοΈ Roadmap: `ROADMAP.md`
46
- - πŸ›οΈ Governance: `GOVERNANCE.md`
47
- - πŸ’¬ Community coordination: `community/COMMUNICATION.md`
48
-
49
- ## 🧩 Initial Workstreams
50
- - `data/` Pashto data collection, cleaning, metadata
51
- - `asr/` speech-to-text baselines and experiments
52
- - `tts/` text-to-speech baselines and experiments
53
- - `benchmarks/` fixed test sets and evaluation scripts
54
- - `apps/desktop/` app integration references
 
22
  - Keep work reproducible, transparent, and contribution-friendly.
23
  - Focus on public good and broad accessibility.
24
 
25
+ ## 🧭 Documentation Map
26
+ - Purpose: `PROJECT_PURPOSE.md`
27
+ - Contributing: `CONTRIBUTING.md`
28
+ - Roadmap: `ROADMAP.md`
29
+ - Governance: `GOVERNANCE.md`
30
+ - Community: `community/COMMUNICATION.md`
31
+ - Release process: `docs/release_process.md`
32
+ - Workstreams: `docs/workstreams.md`
33
+ - Verified external resources: `docs/resource_catalog.md`
34
+
35
+ ## πŸ“š Verified Resource Catalog
36
+ The project now tracks validated external resources in one place:
37
+ - `docs/resource_catalog.md`
38
+
39
+ This catalog includes:
40
+ - Datasets
41
+ - Models
42
+ - Benchmarks
43
+ - Tools and applications
44
+ - Validation notes and integration guidance
45
+
46
+ ## πŸŽ™οΈ Featured Dataset: Common Voice Pashto
47
+ - Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
48
+ - Source: [https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
49
+ - Integration guide: `docs/common_voice_pashto_24.md`
50
 
51
  ## πŸ™Œ Contribute Through Mozilla Common Voice
52
  - Speak: [https://commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak)
 
54
  - Listen: [https://commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen)
55
  - Review: [https://commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review)
56
 
57
+ ## 🧩 Workspaces
58
+ - `data/` datasets, curation, metadata, quality
59
+ - `asr/` ASR baselines and experiments
60
+ - `tts/` TTS baselines and experiments
61
+ - `benchmarks/` benchmark sets and evaluation
62
+ - `apps/desktop/` user-facing integration references
 
 
 
 
 
 
 
 
 
 
 
 
apps/desktop/README.md CHANGED
@@ -1,3 +1,26 @@
1
  # πŸ–₯️ Desktop Integration
2
 
3
  Tracks desktop app integration for ASR/TTS/translation pipelines.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # πŸ–₯️ Desktop Integration
2
 
3
  Tracks desktop app integration for ASR/TTS/translation pipelines.
4
+
5
+ ## βœ… Verified Application Building Blocks
6
+
7
+ ### 🎀 Speech Input: Faster-Whisper
8
+ - Repo: `https://github.com/SYSTRAN/faster-whisper`
9
+ - Use in apps: fast offline/near-real-time transcription components.
10
+
11
+ ### πŸ”ˆ Speech Output: Coqui TTS
12
+ - Repo: `https://github.com/coqui-ai/TTS`
13
+ - Use in apps: local speech synthesis modules for Pashto-enabled UX.
14
+
15
+ ### 🌍 Translation Layer: OPUS MT (via multilingual models)
16
+ - Models:
17
+ - `https://huggingface.co/Helsinki-NLP/opus-mt-en-mul`
18
+ - `https://huggingface.co/Helsinki-NLP/opus-mt-mul-en`
19
+ - Pashto validation: language list includes `pus`.
20
+ - Use in apps: Pashto↔English assistive translation path for demos.
21
+
22
+ ## 🧩 Suggested Desktop Pipeline
23
+ 1. Mic input β†’ ASR transcription
24
+ 2. Optional translation (Pashto ↔ English)
25
+ 3. Optional TTS playback in Pashto
26
+ 4. Save logs for QA and benchmark replay
asr/README.md CHANGED
@@ -1,3 +1,29 @@
1
  # πŸŽ™οΈ ASR Workspace
2
 
3
  Place ASR baselines, training configs, and evaluation scripts here.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # πŸŽ™οΈ ASR Workspace
2
 
3
  Place ASR baselines, training configs, and evaluation scripts here.
4
+
5
+ ## βœ… Verified Pashto-Relevant ASR Models
6
+
7
+ ### 🧠 OpenAI Whisper Large v3
8
+ - Model: `https://huggingface.co/openai/whisper-large-v3`
9
+ - Pashto validation: Whisper tokenizer language map includes `"ps": "pashto"`.
10
+ - Use in this repo: strong baseline and pseudo-labeling engine for bootstrapping.
11
+ - Applications: transcription, subtitle generation, dataset pre-labeling.
12
+
13
+ ### 🌐 Meta MMS Coverage (ASR + TTS language support)
14
+ - Coverage page: `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html`
15
+ - Pashto validation: row includes `pus` with ASR and TTS support.
16
+ - Use in this repo: multilingual transfer baseline when Pashto data is limited.
17
+ - Applications: low-resource ASR transfer experiments.
18
+
19
+ ## βš™οΈ Verified Inference Tooling
20
+
21
+ ### πŸš€ Faster-Whisper
22
+ - Repo: `https://github.com/SYSTRAN/faster-whisper`
23
+ - Why useful: optimized Whisper inference for faster experimentation.
24
+ - Use in this repo: local transcription pipelines and benchmark generation speedups.
25
+
26
+ ## 🧩 Integration Hints
27
+ - Keep all model/eval runs reproducible with command logs and commit hashes.
28
+ - Store evaluation outputs under `benchmarks/` with model/version labels.
29
+ - Track WER/CER with dataset split and normalization policy references.
benchmarks/README.md CHANGED
@@ -1,3 +1,31 @@
1
  # πŸ§ͺ Benchmarks
2
 
3
  Define fixed test sets, metrics, and leaderboard generation scripts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # πŸ§ͺ Benchmarks
2
 
3
  Define fixed test sets, metrics, and leaderboard generation scripts.
4
+
5
+ ## βœ… Verified Benchmark Sources
6
+
7
+ ### 🌸 FLEURS (Pashto speech benchmark)
8
+ - Dataset: `https://huggingface.co/datasets/google/fleurs`
9
+ - Pashto validation: `fleurs.py` includes `ps_af`.
10
+ - Primary use: multilingual ASR benchmark with fixed split conventions.
11
+
12
+ ### πŸ“˜ Belebele (Pashto reading benchmark)
13
+ - Dataset: `https://huggingface.co/datasets/facebook/belebele`
14
+ - Pashto validation: subset includes `pbt_Arab`.
15
+ - Primary use: comprehension benchmark for multilingual NLP models.
16
+
17
+ ### πŸ—£οΈ Common Voice Pashto v24
18
+ - Dataset: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
19
+ - Primary use: ASR train/dev/test experiments and project baseline tracking.
20
+
21
+ ## πŸ“ Recommended Metrics
22
+ - ASR: `WER`, `CER`
23
+ - TTS: `MCD`/objective proxies + human MOS-style scoring
24
+ - NLP: task-specific accuracy/F1 with fixed test set
25
+
26
+ ## 🧾 Reporting Template
27
+ - Benchmark dataset + version
28
+ - Model + checkpoint version
29
+ - Normalization policy version
30
+ - Metrics and error analysis summary
31
+ - Reproducible command/config reference
data/README.md CHANGED
@@ -4,23 +4,48 @@
4
  - `processed/` cleaned/aligned artifacts
5
  - `metadata/` manifests, speaker/dialect info, QA reports
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ## First Contribution (Normalization Starter)
8
  - `processed/normalization_seed_v0.1.tsv` starter normalization examples
9
  - `../docs/pashto_normalization_v0.1.md` baseline normalization policy
10
  - `../scripts/validate_normalization.py` basic file validator
11
 
12
- ## External Source: Mozilla Common Voice (Pashto)
13
- - Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
14
- - Source page:
15
- `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
16
- - Local target path: `data/raw/common_voice_scripted_ps_v24/`
17
- - Integration guide: `../docs/common_voice_pashto_24.md`
18
-
19
- ### Notes
20
- - Keep raw downloaded dataset files out of git.
21
- - Track source URL + version in experiment notes for reproducibility.
22
-
23
- ## Validate Seed File
24
  ```bash
25
  python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
26
  ```
 
 
 
 
 
 
4
  - `processed/` cleaned/aligned artifacts
5
  - `metadata/` manifests, speaker/dialect info, QA reports
6
 
7
+ ## βœ… Verified External Datasets
8
+
9
+ ### πŸŽ™οΈ Common Voice Scripted Speech 24.0 - Pashto
10
+ - Link: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
11
+ - Why useful: largest open community Pashto speech source for ASR training and evaluation.
12
+ - How to use here: download to `data/raw/common_voice_scripted_ps_v24/` and follow `../docs/common_voice_pashto_24.md`.
13
+
14
+ ### 🌸 Google FLEURS (Pashto config)
15
+ - Link: `https://huggingface.co/datasets/google/fleurs`
16
+ - Pashto validation: `fleurs.py` includes `"ps_af"`.
17
+ - Why useful: standardized multilingual speech benchmark split for comparable ASR scores.
18
+ - How to use here: treat as external eval set for `benchmarks/` and avoid training/eval leakage.
19
+
20
+ ### πŸ“– OSCAR Corpus (Pashto web text)
21
+ - Link: `https://huggingface.co/datasets/oscar-corpus/oscar`
22
+ - Pashto validation: dataset includes `unshuffled_deduplicated_ps`.
23
+ - Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
24
+ - How to use here: normalize and sample into `data/processed/` for NLP/ASR language model support.
25
+
26
+ ### πŸ“° Wikimedia Wikipedia (Pashto dump)
27
+ - Link: `https://huggingface.co/datasets/wikimedia/wikipedia`
28
+ - Pashto validation: subset includes `20231101.ps`.
29
+ - Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
30
+ - How to use here: include as a high-quality text source in normalization and glossary workflows.
31
+
32
+ ### πŸ“˜ Belebele (reading-comprehension benchmark)
33
+ - Link: `https://huggingface.co/datasets/facebook/belebele`
34
+ - Pashto validation: subset includes `pbt_Arab`.
35
+ - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
36
+ - How to use here: benchmark multilingual encoders and track improvements in `benchmarks/`.
37
+
38
  ## First Contribution (Normalization Starter)
39
  - `processed/normalization_seed_v0.1.tsv` starter normalization examples
40
  - `../docs/pashto_normalization_v0.1.md` baseline normalization policy
41
  - `../scripts/validate_normalization.py` basic file validator
42
 
43
+ ## πŸ§ͺ Validate Seed File
 
 
 
 
 
 
 
 
 
 
 
44
  ```bash
45
  python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
46
  ```
47
+
48
+ ## πŸ“ Notes
49
+ - Keep raw downloaded dataset files out of git.
50
+ - Track source URL + version in experiment notes for reproducibility.
51
+ - Re-check external links before every milestone release.
docs/platforms.md CHANGED
@@ -1,15 +1,23 @@
1
  # 🌐 Platforms
2
 
3
- ## 🧭 Primary platforms
4
  - GitHub: code, issues, pull requests, releases
5
  - Hugging Face Hub: models, datasets, demos
6
  - Community chat (Discord/Matrix): contributor coordination
7
 
8
- ## πŸ“£ Publishing expectations
9
- - Every release links to changelog + benchmark snapshot
10
- - Every model links to dataset provenance and eval metrics
11
 
12
- ## 🀝 Community resource profiles
13
- - Hugging Face Pashto resource profile (external): `https://huggingface.co/ihanif`
14
- - Contributors can review this profile for reference datasets/models related to
15
- Pashto inclusion work.
 
 
 
 
 
 
 
 
 
1
  # 🌐 Platforms
2
 
3
+ ## 🧭 Primary Platforms
4
  - GitHub: code, issues, pull requests, releases
5
  - Hugging Face Hub: models, datasets, demos
6
  - Community chat (Discord/Matrix): contributor coordination
7
 
8
+ ## πŸ“š Resource Discovery and Validation
9
+ - Use `docs/resource_catalog.md` as the single source of truth for validated external resources.
10
+ - Add new links only after checking official pages and explicit Pashto support markers.
11
 
12
+ ## πŸ“£ Publishing Expectations
13
+ - Every release links to changelog + benchmark snapshot.
14
+ - Every model links to dataset provenance and eval metrics.
15
+ - Every new external link must include use-case notes and where it belongs in repo structure.
16
+
17
+ ## πŸš€ Dual Publish Checklist (GitHub + Hugging Face)
18
+ 1. `git status` is clean except intended changes.
19
+ 2. Docs and resource links updated.
20
+ 3. Commit created with clear scope.
21
+ 4. Push to `origin` (GitHub).
22
+ 5. Push to `hf` (Hugging Face).
23
+ 6. Verify README render and link health on both platforms.
docs/resource_catalog.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“š Verified Pashto Resource Catalog
2
+
3
+ Last updated: `2026-02-14`
4
+
5
+ This catalog lists external resources validated for Pashto relevance and possible integration in this repository.
6
+
7
+ ## βœ… Validation Method
8
+ - Confirmed the source URL resolves to the official page.
9
+ - Confirmed Pashto support by explicit code/name on the page where available (`ps`, `ps_af`, `pbt_Arab`, `pus`).
10
+ - Added only resources with clear practical use for this repo (data, models, benchmarks, apps).
11
+
12
+ ## πŸ—‚οΈ Datasets
13
+
14
+ | Resource | Link | Pashto Validation | How to Use Here | Applications |
15
+ |---|---|---|---|---|
16
+ | Common Voice Scripted Speech 24.0 - Pashto | `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` | Official Pashto dataset page | ASR training/eval under `data/raw/common_voice_scripted_ps_v24/` | ASR, pronunciation analysis, data bootstrapping |
17
+ | Google FLEURS | `https://huggingface.co/datasets/google/fleurs` | `fleurs.py` includes `ps_af` | External benchmark split in `benchmarks/` | Multilingual ASR benchmarking |
18
+ | OSCAR Corpus | `https://huggingface.co/datasets/oscar-corpus/oscar` | Includes `unshuffled_deduplicated_ps` | Text LM and normalization support in `data/processed/` | NLP pretraining, lexicon growth |
19
+ | Wikimedia Wikipedia | `https://huggingface.co/datasets/wikimedia/wikipedia` | Includes `20231101.ps` | High-quality text source and terminology checks | NLP, glossary, language modeling |
20
+ | Belebele | `https://huggingface.co/datasets/facebook/belebele` | Includes `pbt_Arab` subset | Comprehension benchmark in `benchmarks/` | Multilingual reading-comprehension eval |
21
+
22
+ ## πŸ€– Models
23
+
24
+ | Resource | Link | Pashto Validation | How to Use Here | Applications |
25
+ |---|---|---|---|---|
26
+ | Whisper Large v3 | `https://huggingface.co/openai/whisper-large-v3` | Whisper tokenizer language map contains `"ps": "pashto"` | ASR baseline in `asr/` | Transcription, subtitle generation |
27
+ | MMS language coverage | `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html` | Row for `pus` shows ASR/TTS support | Compare multilingual transfer baselines | Low-resource ASR/TTS transfer |
28
+ | MMS TTS model collection | `https://huggingface.co/facebook/mms-tts` | Official MMS TTS collection aligned with coverage table | Evaluate multilingual Pashto TTS checkpoints in `tts/` | Speech synthesis and assistive audio |
29
+ | NLLB-200 Distilled 600M | `https://huggingface.co/facebook/nllb-200-distilled-600M` | `special_tokens_map.json` contains `pbt_Arab` | Baseline translation experiments under `apps/` and `benchmarks/` | Pashto-centered MT pipelines |
30
+ | OPUS MT en→mul | `https://huggingface.co/Helsinki-NLP/opus-mt-en-mul` | Language list includes `pus` | Pashto translation baseline via multilingual target | Translation in demos and tooling |
31
+ | OPUS MT mul→en | `https://huggingface.co/Helsinki-NLP/opus-mt-mul-en` | Language list includes `pus` | Reverse translation baseline | Translation and bilingual UX |
32
+
33
+ ## πŸ§ͺ Benchmarks and Evaluation
34
+
35
+ | Resource | Link | Recommended Metric Focus |
36
+ |---|---|---|
37
+ | FLEURS (speech) | `https://huggingface.co/datasets/google/fleurs` | WER, CER |
38
+ | Common Voice Pashto | `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` | WER, CER, error buckets |
39
+ | Belebele (reading) | `https://huggingface.co/datasets/facebook/belebele` | Accuracy/F1 |
40
+
41
+ ## πŸ–₯️ Applications and Tooling
42
+
43
+ | Resource | Link | How to Use Here |
44
+ |---|---|---|
45
+ | Faster-Whisper | `https://github.com/SYSTRAN/faster-whisper` | Production-style and local ASR inference speedups |
46
+ | Coqui TTS | `https://github.com/coqui-ai/TTS` | Train/fine-tune Pashto TTS and run desktop synthesis |
47
+
48
+ ## πŸ“„ Research Anchors (for reading and citation)
49
+ - Whisper paper: `https://arxiv.org/abs/2212.04356`
50
+ - MMS paper: `https://arxiv.org/abs/2305.13516`
51
+ - NLLB paper: `https://arxiv.org/abs/2207.04672`
52
+ - FLEURS paper: `https://arxiv.org/abs/2205.12446`
53
+
54
+ ## πŸ”„ Maintenance Rule
55
+ Before each release, re-open each external link and confirm:
56
+ - Resource still exists.
57
+ - Pashto support marker is unchanged.
58
+ - License/usage terms are still compatible with this project.
docs/workstreams.md CHANGED
@@ -2,15 +2,23 @@
2
 
3
  ## πŸ—‚οΈ Data
4
  - Collection guides, consent, validation, and metadata policy.
 
5
 
6
  ## πŸŽ™οΈ ASR
7
  - Baselines, fine-tuning recipes, and evaluation scripts.
 
8
 
9
  ## πŸ”Š TTS
10
  - Baselines, speaker/style control, and quality assessment.
 
11
 
12
  ## πŸ§ͺ Benchmarks
13
  - Fixed test set, metric definitions, and leaderboard process.
 
14
 
15
  ## πŸ–₯️ Applications
16
  - Desktop and API integrations for real-user testing.
 
 
 
 
 
2
 
3
  ## πŸ—‚οΈ Data
4
  - Collection guides, consent, validation, and metadata policy.
5
+ - External dataset references: `data/README.md`
6
 
7
  ## πŸŽ™οΈ ASR
8
  - Baselines, fine-tuning recipes, and evaluation scripts.
9
+ - External ASR models/tools: `asr/README.md`
10
 
11
  ## πŸ”Š TTS
12
  - Baselines, speaker/style control, and quality assessment.
13
+ - External TTS models/tools: `tts/README.md`
14
 
15
  ## πŸ§ͺ Benchmarks
16
  - Fixed test set, metric definitions, and leaderboard process.
17
+ - Benchmark resources: `benchmarks/README.md`
18
 
19
  ## πŸ–₯️ Applications
20
  - Desktop and API integrations for real-user testing.
21
+ - Integration resources: `apps/desktop/README.md`
22
+
23
+ ## πŸ“š Master Resource Index
24
+ - Full validated list: `docs/resource_catalog.md`
tts/README.md CHANGED
@@ -1,3 +1,26 @@
1
  # πŸ”Š TTS Workspace
2
 
3
  Place TTS baselines, training configs, and quality-evaluation scripts here.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # πŸ”Š TTS Workspace
2
 
3
  Place TTS baselines, training configs, and quality-evaluation scripts here.
4
+
5
+ ## βœ… Verified Pashto-Relevant TTS Resources
6
+
7
+ ### 🌐 Meta MMS Coverage (ASR + TTS language support)
8
+ - Coverage page: `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html`
9
+ - Pashto validation: row includes `pus` with TTS support.
10
+ - Use in this repo: multilingual transfer baseline for Pashto synthesis.
11
+ - Applications: baseline voice generation, pronunciation checks, accessibility tools.
12
+
13
+ ### πŸ§ͺ Meta MMS TTS Model Collection
14
+ - Model card: `https://huggingface.co/facebook/mms-tts`
15
+ - Why useful: broad multilingual TTS package with language-specific checkpoints.
16
+ - Use in this repo: evaluate Pashto synthesis quality against curated text prompts.
17
+
18
+ ### πŸ› οΈ Coqui TTS Toolkit
19
+ - Repo: `https://github.com/coqui-ai/TTS`
20
+ - Why useful: open training/inference toolkit to fine-tune custom Pashto voices.
21
+ - Use in this repo: reproducible TTS training scripts and quality A/B experiments.
22
+
23
+ ## 🧩 Integration Hints
24
+ - Keep text normalization consistent between ASR and TTS experiments.
25
+ - Pair objective metrics with human listening checks in benchmark notes.
26
+ - Document voice style, speaker metadata, and license provenance for every release.