musaw commited on
Commit Β·
0052610
1
Parent(s): 4f4e013
docs: add validated Pashto resource catalog and workspace guides
Browse files- CONTRIBUTING.md +14 -5
- README.md +31 -23
- apps/desktop/README.md +23 -0
- asr/README.md +26 -0
- benchmarks/README.md +28 -0
- data/README.md +37 -12
- docs/platforms.md +16 -8
- docs/resource_catalog.md +58 -0
- docs/workstreams.md +8 -0
- tts/README.md +23 -0
CONTRIBUTING.md
CHANGED
|
@@ -2,15 +2,15 @@
|
|
| 2 |
|
| 3 |
Thanks for helping build open Pashto AI resources.
|
| 4 |
|
| 5 |
-
## π§© Ways to
|
| 6 |
- Data recording and validation
|
| 7 |
- Text normalization and terminology fixes
|
| 8 |
- Model training/evaluation scripts
|
| 9 |
- Documentation, issue triage, and testing
|
|
|
|
| 10 |
|
| 11 |
## π Mozilla Common Voice Path
|
| 12 |
-
You can contribute to Pashto data directly on Common Voice and connect it back
|
| 13 |
-
to this project.
|
| 14 |
|
| 15 |
Common Voice Pashto actions:
|
| 16 |
- Speak: `https://commonvoice.mozilla.org/ps/speak`
|
|
@@ -23,7 +23,16 @@ Then contribute here by opening an issue/PR with:
|
|
| 23 |
- what data quality gap you observed,
|
| 24 |
- what concrete follow-up is needed in this repository.
|
| 25 |
|
| 26 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
1. Open or pick an issue.
|
| 28 |
2. Comment your plan.
|
| 29 |
3. Create a branch and make focused changes.
|
|
@@ -35,7 +44,7 @@ Then contribute here by opening an issue/PR with:
|
|
| 35 |
- Document assumptions, limitations, and risks.
|
| 36 |
- Respect contributors and community guidelines.
|
| 37 |
|
| 38 |
-
## π·οΈ Priority
|
| 39 |
- `good first issue`
|
| 40 |
- `data`
|
| 41 |
- `asr`
|
|
|
|
| 2 |
|
| 3 |
Thanks for helping build open Pashto AI resources.
|
| 4 |
|
| 5 |
+
## π§© Ways to Contribute
|
| 6 |
- Data recording and validation
|
| 7 |
- Text normalization and terminology fixes
|
| 8 |
- Model training/evaluation scripts
|
| 9 |
- Documentation, issue triage, and testing
|
| 10 |
+
- External resource discovery and validation
|
| 11 |
|
| 12 |
## π Mozilla Common Voice Path
|
| 13 |
+
You can contribute to Pashto data directly on Common Voice and connect it back to this project.
|
|
|
|
| 14 |
|
| 15 |
Common Voice Pashto actions:
|
| 16 |
- Speak: `https://commonvoice.mozilla.org/ps/speak`
|
|
|
|
| 23 |
- what data quality gap you observed,
|
| 24 |
- what concrete follow-up is needed in this repository.
|
| 25 |
|
| 26 |
+
## π External Resource Contribution Rules
|
| 27 |
+
- Add links in the correct workspace README (`data`, `asr`, `tts`, `benchmarks`, `apps`).
|
| 28 |
+
- Update `docs/resource_catalog.md` with:
|
| 29 |
+
- what the resource is,
|
| 30 |
+
- explicit Pashto support evidence,
|
| 31 |
+
- how it can be used in this repository,
|
| 32 |
+
- practical applications.
|
| 33 |
+
- Prefer official pages and model/dataset cards over third-party reposts.
|
| 34 |
+
|
| 35 |
+
## π Contribution Flow
|
| 36 |
1. Open or pick an issue.
|
| 37 |
2. Comment your plan.
|
| 38 |
3. Create a branch and make focused changes.
|
|
|
|
| 44 |
- Document assumptions, limitations, and risks.
|
| 45 |
- Respect contributors and community guidelines.
|
| 46 |
|
| 47 |
+
## π·οΈ Priority Labels (Recommended)
|
| 48 |
- `good first issue`
|
| 49 |
- `data`
|
| 50 |
- `asr`
|
README.md
CHANGED
|
@@ -22,11 +22,31 @@ Community-led open-source project to make Pashto a first-class language in AI sp
|
|
| 22 |
- Keep work reproducible, transparent, and contribution-friendly.
|
| 23 |
- Focus on public good and broad accessibility.
|
| 24 |
|
| 25 |
-
##
|
| 26 |
-
-
|
| 27 |
-
-
|
| 28 |
-
|
| 29 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## π Contribute Through Mozilla Common Voice
|
| 32 |
- Speak: [https://commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak)
|
|
@@ -34,21 +54,9 @@ Community-led open-source project to make Pashto a first-class language in AI sp
|
|
| 34 |
- Listen: [https://commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen)
|
| 35 |
- Review: [https://commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review)
|
| 36 |
|
| 37 |
-
##
|
| 38 |
-
-
|
| 39 |
-
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
- π Purpose: `PROJECT_PURPOSE.md`
|
| 44 |
-
- π€ Contributing: `CONTRIBUTING.md`
|
| 45 |
-
- πΊοΈ Roadmap: `ROADMAP.md`
|
| 46 |
-
- ποΈ Governance: `GOVERNANCE.md`
|
| 47 |
-
- π¬ Community coordination: `community/COMMUNICATION.md`
|
| 48 |
-
|
| 49 |
-
## π§© Initial Workstreams
|
| 50 |
-
- `data/` Pashto data collection, cleaning, metadata
|
| 51 |
-
- `asr/` speech-to-text baselines and experiments
|
| 52 |
-
- `tts/` text-to-speech baselines and experiments
|
| 53 |
-
- `benchmarks/` fixed test sets and evaluation scripts
|
| 54 |
-
- `apps/desktop/` app integration references
|
|
|
|
| 22 |
- Keep work reproducible, transparent, and contribution-friendly.
|
| 23 |
- Focus on public good and broad accessibility.
|
| 24 |
|
| 25 |
+
## π§ Documentation Map
|
| 26 |
+
- Purpose: `PROJECT_PURPOSE.md`
|
| 27 |
+
- Contributing: `CONTRIBUTING.md`
|
| 28 |
+
- Roadmap: `ROADMAP.md`
|
| 29 |
+
- Governance: `GOVERNANCE.md`
|
| 30 |
+
- Community: `community/COMMUNICATION.md`
|
| 31 |
+
- Release process: `docs/release_process.md`
|
| 32 |
+
- Workstreams: `docs/workstreams.md`
|
| 33 |
+
- Verified external resources: `docs/resource_catalog.md`
|
| 34 |
+
|
| 35 |
+
## π Verified Resource Catalog
|
| 36 |
+
The project now tracks validated external resources in one place:
|
| 37 |
+
- `docs/resource_catalog.md`
|
| 38 |
+
|
| 39 |
+
This catalog includes:
|
| 40 |
+
- Datasets
|
| 41 |
+
- Models
|
| 42 |
+
- Benchmarks
|
| 43 |
+
- Tools and applications
|
| 44 |
+
- Validation notes and integration guidance
|
| 45 |
+
|
| 46 |
+
## ποΈ Featured Dataset: Common Voice Pashto
|
| 47 |
+
- Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
|
| 48 |
+
- Source: [https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
|
| 49 |
+
- Integration guide: `docs/common_voice_pashto_24.md`
|
| 50 |
|
| 51 |
## π Contribute Through Mozilla Common Voice
|
| 52 |
- Speak: [https://commonvoice.mozilla.org/ps/speak](https://commonvoice.mozilla.org/ps/speak)
|
|
|
|
| 54 |
- Listen: [https://commonvoice.mozilla.org/ps/listen](https://commonvoice.mozilla.org/ps/listen)
|
| 55 |
- Review: [https://commonvoice.mozilla.org/ps/review](https://commonvoice.mozilla.org/ps/review)
|
| 56 |
|
| 57 |
+
## π§© Workspaces
|
| 58 |
+
- `data/` datasets, curation, metadata, quality
|
| 59 |
+
- `asr/` ASR baselines and experiments
|
| 60 |
+
- `tts/` TTS baselines and experiments
|
| 61 |
+
- `benchmarks/` benchmark sets and evaluation
|
| 62 |
+
- `apps/desktop/` user-facing integration references
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
apps/desktop/README.md
CHANGED
|
@@ -1,3 +1,26 @@
|
|
| 1 |
# π₯οΈ Desktop Integration
|
| 2 |
|
| 3 |
Tracks desktop app integration for ASR/TTS/translation pipelines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# π₯οΈ Desktop Integration
|
| 2 |
|
| 3 |
Tracks desktop app integration for ASR/TTS/translation pipelines.
|
| 4 |
+
|
| 5 |
+
## β
Verified Application Building Blocks
|
| 6 |
+
|
| 7 |
+
### π€ Speech Input: Faster-Whisper
|
| 8 |
+
- Repo: `https://github.com/SYSTRAN/faster-whisper`
|
| 9 |
+
- Use in apps: fast offline/near-real-time transcription components.
|
| 10 |
+
|
| 11 |
+
### π Speech Output: Coqui TTS
|
| 12 |
+
- Repo: `https://github.com/coqui-ai/TTS`
|
| 13 |
+
- Use in apps: local speech synthesis modules for Pashto-enabled UX.
|
| 14 |
+
|
| 15 |
+
### π Translation Layer: OPUS MT (via multilingual models)
|
| 16 |
+
- Models:
|
| 17 |
+
- `https://huggingface.co/Helsinki-NLP/opus-mt-en-mul`
|
| 18 |
+
- `https://huggingface.co/Helsinki-NLP/opus-mt-mul-en`
|
| 19 |
+
- Pashto validation: language list includes `pus`.
|
| 20 |
+
- Use in apps: PashtoβEnglish assistive translation path for demos.
|
| 21 |
+
|
| 22 |
+
## π§© Suggested Desktop Pipeline
|
| 23 |
+
1. Mic input β ASR transcription
|
| 24 |
+
2. Optional translation (Pashto β English)
|
| 25 |
+
3. Optional TTS playback in Pashto
|
| 26 |
+
4. Save logs for QA and benchmark replay
|
asr/README.md
CHANGED
|
@@ -1,3 +1,29 @@
|
|
| 1 |
# ποΈ ASR Workspace
|
| 2 |
|
| 3 |
Place ASR baselines, training configs, and evaluation scripts here.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# ποΈ ASR Workspace
|
| 2 |
|
| 3 |
Place ASR baselines, training configs, and evaluation scripts here.
|
| 4 |
+
|
| 5 |
+
## β
Verified Pashto-Relevant ASR Models
|
| 6 |
+
|
| 7 |
+
### π§ OpenAI Whisper Large v3
|
| 8 |
+
- Model: `https://huggingface.co/openai/whisper-large-v3`
|
| 9 |
+
- Pashto validation: Whisper tokenizer language map includes `"ps": "pashto"`.
|
| 10 |
+
- Use in this repo: strong baseline and pseudo-labeling engine for bootstrapping.
|
| 11 |
+
- Applications: transcription, subtitle generation, dataset pre-labeling.
|
| 12 |
+
|
| 13 |
+
### π Meta MMS Coverage (ASR + TTS language support)
|
| 14 |
+
- Coverage page: `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html`
|
| 15 |
+
- Pashto validation: row includes `pus` with ASR and TTS support.
|
| 16 |
+
- Use in this repo: multilingual transfer baseline when Pashto data is limited.
|
| 17 |
+
- Applications: low-resource ASR transfer experiments.
|
| 18 |
+
|
| 19 |
+
## βοΈ Verified Inference Tooling
|
| 20 |
+
|
| 21 |
+
### π Faster-Whisper
|
| 22 |
+
- Repo: `https://github.com/SYSTRAN/faster-whisper`
|
| 23 |
+
- Why useful: optimized Whisper inference for faster experimentation.
|
| 24 |
+
- Use in this repo: local transcription pipelines and benchmark generation speedups.
|
| 25 |
+
|
| 26 |
+
## π§© Integration Hints
|
| 27 |
+
- Keep all model/eval runs reproducible with command logs and commit hashes.
|
| 28 |
+
- Store evaluation outputs under `benchmarks/` with model/version labels.
|
| 29 |
+
- Track WER/CER with dataset split and normalization policy references.
|
benchmarks/README.md
CHANGED
|
@@ -1,3 +1,31 @@
|
|
| 1 |
# π§ͺ Benchmarks
|
| 2 |
|
| 3 |
Define fixed test sets, metrics, and leaderboard generation scripts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# π§ͺ Benchmarks
|
| 2 |
|
| 3 |
Define fixed test sets, metrics, and leaderboard generation scripts.
|
| 4 |
+
|
| 5 |
+
## β
Verified Benchmark Sources
|
| 6 |
+
|
| 7 |
+
### πΈ FLEURS (Pashto speech benchmark)
|
| 8 |
+
- Dataset: `https://huggingface.co/datasets/google/fleurs`
|
| 9 |
+
- Pashto validation: `fleurs.py` includes `ps_af`.
|
| 10 |
+
- Primary use: multilingual ASR benchmark with fixed split conventions.
|
| 11 |
+
|
| 12 |
+
### π Belebele (Pashto reading benchmark)
|
| 13 |
+
- Dataset: `https://huggingface.co/datasets/facebook/belebele`
|
| 14 |
+
- Pashto validation: subset includes `pbt_Arab`.
|
| 15 |
+
- Primary use: comprehension benchmark for multilingual NLP models.
|
| 16 |
+
|
| 17 |
+
### π£οΈ Common Voice Pashto v24
|
| 18 |
+
- Dataset: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
|
| 19 |
+
- Primary use: ASR train/dev/test experiments and project baseline tracking.
|
| 20 |
+
|
| 21 |
+
## π Recommended Metrics
|
| 22 |
+
- ASR: `WER`, `CER`
|
| 23 |
+
- TTS: `MCD`/objective proxies + human MOS-style scoring
|
| 24 |
+
- NLP: task-specific accuracy/F1 with fixed test set
|
| 25 |
+
|
| 26 |
+
## π§Ύ Reporting Template
|
| 27 |
+
- Benchmark dataset + version
|
| 28 |
+
- Model + checkpoint version
|
| 29 |
+
- Normalization policy version
|
| 30 |
+
- Metrics and error analysis summary
|
| 31 |
+
- Reproducible command/config reference
|
data/README.md
CHANGED
|
@@ -4,23 +4,48 @@
|
|
| 4 |
- `processed/` cleaned/aligned artifacts
|
| 5 |
- `metadata/` manifests, speaker/dialect info, QA reports
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## First Contribution (Normalization Starter)
|
| 8 |
- `processed/normalization_seed_v0.1.tsv` starter normalization examples
|
| 9 |
- `../docs/pashto_normalization_v0.1.md` baseline normalization policy
|
| 10 |
- `../scripts/validate_normalization.py` basic file validator
|
| 11 |
|
| 12 |
-
##
|
| 13 |
-
- Dataset: `Common Voice Scripted Speech 24.0 - Pashto`
|
| 14 |
-
- Source page:
|
| 15 |
-
`https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
|
| 16 |
-
- Local target path: `data/raw/common_voice_scripted_ps_v24/`
|
| 17 |
-
- Integration guide: `../docs/common_voice_pashto_24.md`
|
| 18 |
-
|
| 19 |
-
### Notes
|
| 20 |
-
- Keep raw downloaded dataset files out of git.
|
| 21 |
-
- Track source URL + version in experiment notes for reproducibility.
|
| 22 |
-
|
| 23 |
-
## Validate Seed File
|
| 24 |
```bash
|
| 25 |
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
|
| 26 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- `processed/` cleaned/aligned artifacts
|
| 5 |
- `metadata/` manifests, speaker/dialect info, QA reports
|
| 6 |
|
| 7 |
+
## β
Verified External Datasets
|
| 8 |
+
|
| 9 |
+
### ποΈ Common Voice Scripted Speech 24.0 - Pashto
|
| 10 |
+
- Link: `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14`
|
| 11 |
+
- Why useful: largest open community Pashto speech source for ASR training and evaluation.
|
| 12 |
+
- How to use here: download to `data/raw/common_voice_scripted_ps_v24/` and follow `../docs/common_voice_pashto_24.md`.
|
| 13 |
+
|
| 14 |
+
### πΈ Google FLEURS (Pashto config)
|
| 15 |
+
- Link: `https://huggingface.co/datasets/google/fleurs`
|
| 16 |
+
- Pashto validation: `fleurs.py` includes `"ps_af"`.
|
| 17 |
+
- Why useful: standardized multilingual speech benchmark split for comparable ASR scores.
|
| 18 |
+
- How to use here: treat as external eval set for `benchmarks/` and avoid training/eval leakage.
|
| 19 |
+
|
| 20 |
+
### π OSCAR Corpus (Pashto web text)
|
| 21 |
+
- Link: `https://huggingface.co/datasets/oscar-corpus/oscar`
|
| 22 |
+
- Pashto validation: dataset includes `unshuffled_deduplicated_ps`.
|
| 23 |
+
- Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
|
| 24 |
+
- How to use here: normalize and sample into `data/processed/` for NLP/ASR language model support.
|
| 25 |
+
|
| 26 |
+
### π° Wikimedia Wikipedia (Pashto dump)
|
| 27 |
+
- Link: `https://huggingface.co/datasets/wikimedia/wikipedia`
|
| 28 |
+
- Pashto validation: subset includes `20231101.ps`.
|
| 29 |
+
- Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
|
| 30 |
+
- How to use here: include as a high-quality text source in normalization and glossary workflows.
|
| 31 |
+
|
| 32 |
+
### π Belebele (reading-comprehension benchmark)
|
| 33 |
+
- Link: `https://huggingface.co/datasets/facebook/belebele`
|
| 34 |
+
- Pashto validation: subset includes `pbt_Arab`.
|
| 35 |
+
- Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
|
| 36 |
+
- How to use here: benchmark multilingual encoders and track improvements in `benchmarks/`.
|
| 37 |
+
|
| 38 |
## First Contribution (Normalization Starter)
|
| 39 |
- `processed/normalization_seed_v0.1.tsv` starter normalization examples
|
| 40 |
- `../docs/pashto_normalization_v0.1.md` baseline normalization policy
|
| 41 |
- `../scripts/validate_normalization.py` basic file validator
|
| 42 |
|
| 43 |
+
## π§ͺ Validate Seed File
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
```bash
|
| 45 |
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
|
| 46 |
```
|
| 47 |
+
|
| 48 |
+
## π Notes
|
| 49 |
+
- Keep raw downloaded dataset files out of git.
|
| 50 |
+
- Track source URL + version in experiment notes for reproducibility.
|
| 51 |
+
- Re-check external links before every milestone release.
|
docs/platforms.md
CHANGED
|
@@ -1,15 +1,23 @@
|
|
| 1 |
# π Platforms
|
| 2 |
|
| 3 |
-
## π§ Primary
|
| 4 |
- GitHub: code, issues, pull requests, releases
|
| 5 |
- Hugging Face Hub: models, datasets, demos
|
| 6 |
- Community chat (Discord/Matrix): contributor coordination
|
| 7 |
|
| 8 |
-
##
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
|
| 12 |
-
##
|
| 13 |
-
-
|
| 14 |
-
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# π Platforms
|
| 2 |
|
| 3 |
+
## π§ Primary Platforms
|
| 4 |
- GitHub: code, issues, pull requests, releases
|
| 5 |
- Hugging Face Hub: models, datasets, demos
|
| 6 |
- Community chat (Discord/Matrix): contributor coordination
|
| 7 |
|
| 8 |
+
## π Resource Discovery and Validation
|
| 9 |
+
- Use `docs/resource_catalog.md` as the single source of truth for validated external resources.
|
| 10 |
+
- Add new links only after checking official pages and explicit Pashto support markers.
|
| 11 |
|
| 12 |
+
## π£ Publishing Expectations
|
| 13 |
+
- Every release links to changelog + benchmark snapshot.
|
| 14 |
+
- Every model links to dataset provenance and eval metrics.
|
| 15 |
+
- Every new external link must include use-case notes and where it belongs in repo structure.
|
| 16 |
+
|
| 17 |
+
## π Dual Publish Checklist (GitHub + Hugging Face)
|
| 18 |
+
1. `git status` is clean except intended changes.
|
| 19 |
+
2. Docs and resource links updated.
|
| 20 |
+
3. Commit created with clear scope.
|
| 21 |
+
4. Push to `origin` (GitHub).
|
| 22 |
+
5. Push to `hf` (Hugging Face).
|
| 23 |
+
6. Verify README render and link health on both platforms.
|
docs/resource_catalog.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Verified Pashto Resource Catalog
|
| 2 |
+
|
| 3 |
+
Last updated: `2026-02-14`
|
| 4 |
+
|
| 5 |
+
This catalog lists external resources validated for Pashto relevance and possible integration in this repository.
|
| 6 |
+
|
| 7 |
+
## β
Validation Method
|
| 8 |
+
- Confirmed the source URL resolves to the official page.
|
| 9 |
+
- Confirmed Pashto support by explicit code/name on the page where available (`ps`, `ps_af`, `pbt_Arab`, `pus`).
|
| 10 |
+
- Added only resources with clear practical use for this repo (data, models, benchmarks, apps).
|
| 11 |
+
|
| 12 |
+
## ποΈ Datasets
|
| 13 |
+
|
| 14 |
+
| Resource | Link | Pashto Validation | How to Use Here | Applications |
|
| 15 |
+
|---|---|---|---|---|
|
| 16 |
+
| Common Voice Scripted Speech 24.0 - Pashto | `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` | Official Pashto dataset page | ASR training/eval under `data/raw/common_voice_scripted_ps_v24/` | ASR, pronunciation analysis, data bootstrapping |
|
| 17 |
+
| Google FLEURS | `https://huggingface.co/datasets/google/fleurs` | `fleurs.py` includes `ps_af` | External benchmark split in `benchmarks/` | Multilingual ASR benchmarking |
|
| 18 |
+
| OSCAR Corpus | `https://huggingface.co/datasets/oscar-corpus/oscar` | Includes `unshuffled_deduplicated_ps` | Text LM and normalization support in `data/processed/` | NLP pretraining, lexicon growth |
|
| 19 |
+
| Wikimedia Wikipedia | `https://huggingface.co/datasets/wikimedia/wikipedia` | Includes `20231101.ps` | High-quality text source and terminology checks | NLP, glossary, language modeling |
|
| 20 |
+
| Belebele | `https://huggingface.co/datasets/facebook/belebele` | Includes `pbt_Arab` subset | Comprehension benchmark in `benchmarks/` | Multilingual reading-comprehension eval |
|
| 21 |
+
|
| 22 |
+
## π€ Models
|
| 23 |
+
|
| 24 |
+
| Resource | Link | Pashto Validation | How to Use Here | Applications |
|
| 25 |
+
|---|---|---|---|---|
|
| 26 |
+
| Whisper Large v3 | `https://huggingface.co/openai/whisper-large-v3` | Whisper tokenizer language map contains `"ps": "pashto"` | ASR baseline in `asr/` | Transcription, subtitle generation |
|
| 27 |
+
| MMS language coverage | `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html` | Row for `pus` shows ASR/TTS support | Compare multilingual transfer baselines | Low-resource ASR/TTS transfer |
|
| 28 |
+
| MMS TTS model collection | `https://huggingface.co/facebook/mms-tts` | Official MMS TTS collection aligned with coverage table | Evaluate multilingual Pashto TTS checkpoints in `tts/` | Speech synthesis and assistive audio |
|
| 29 |
+
| NLLB-200 Distilled 600M | `https://huggingface.co/facebook/nllb-200-distilled-600M` | `special_tokens_map.json` contains `pbt_Arab` | Baseline translation experiments under `apps/` and `benchmarks/` | Pashto-centered MT pipelines |
|
| 30 |
+
| OPUS MT enβmul | `https://huggingface.co/Helsinki-NLP/opus-mt-en-mul` | Language list includes `pus` | Pashto translation baseline via multilingual target | Translation in demos and tooling |
|
| 31 |
+
| OPUS MT mulβen | `https://huggingface.co/Helsinki-NLP/opus-mt-mul-en` | Language list includes `pus` | Reverse translation baseline | Translation and bilingual UX |
|
| 32 |
+
|
| 33 |
+
## π§ͺ Benchmarks and Evaluation
|
| 34 |
+
|
| 35 |
+
| Resource | Link | Recommended Metric Focus |
|
| 36 |
+
|---|---|---|
|
| 37 |
+
| FLEURS (speech) | `https://huggingface.co/datasets/google/fleurs` | WER, CER |
|
| 38 |
+
| Common Voice Pashto | `https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14` | WER, CER, error buckets |
|
| 39 |
+
| Belebele (reading) | `https://huggingface.co/datasets/facebook/belebele` | Accuracy/F1 |
|
| 40 |
+
|
| 41 |
+
## π₯οΈ Applications and Tooling
|
| 42 |
+
|
| 43 |
+
| Resource | Link | How to Use Here |
|
| 44 |
+
|---|---|---|
|
| 45 |
+
| Faster-Whisper | `https://github.com/SYSTRAN/faster-whisper` | Production-style and local ASR inference speedups |
|
| 46 |
+
| Coqui TTS | `https://github.com/coqui-ai/TTS` | Train/fine-tune Pashto TTS and run desktop synthesis |
|
| 47 |
+
|
| 48 |
+
## π Research Anchors (for reading and citation)
|
| 49 |
+
- Whisper paper: `https://arxiv.org/abs/2212.04356`
|
| 50 |
+
- MMS paper: `https://arxiv.org/abs/2305.13516`
|
| 51 |
+
- NLLB paper: `https://arxiv.org/abs/2207.04672`
|
| 52 |
+
- FLEURS paper: `https://arxiv.org/abs/2205.12446`
|
| 53 |
+
|
| 54 |
+
## π Maintenance Rule
|
| 55 |
+
Before each release, re-open each external link and confirm:
|
| 56 |
+
- Resource still exists.
|
| 57 |
+
- Pashto support marker is unchanged.
|
| 58 |
+
- License/usage terms are still compatible with this project.
|
docs/workstreams.md
CHANGED
|
@@ -2,15 +2,23 @@
|
|
| 2 |
|
| 3 |
## ποΈ Data
|
| 4 |
- Collection guides, consent, validation, and metadata policy.
|
|
|
|
| 5 |
|
| 6 |
## ποΈ ASR
|
| 7 |
- Baselines, fine-tuning recipes, and evaluation scripts.
|
|
|
|
| 8 |
|
| 9 |
## π TTS
|
| 10 |
- Baselines, speaker/style control, and quality assessment.
|
|
|
|
| 11 |
|
| 12 |
## π§ͺ Benchmarks
|
| 13 |
- Fixed test set, metric definitions, and leaderboard process.
|
|
|
|
| 14 |
|
| 15 |
## π₯οΈ Applications
|
| 16 |
- Desktop and API integrations for real-user testing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
## ποΈ Data
|
| 4 |
- Collection guides, consent, validation, and metadata policy.
|
| 5 |
+
- External dataset references: `data/README.md`
|
| 6 |
|
| 7 |
## ποΈ ASR
|
| 8 |
- Baselines, fine-tuning recipes, and evaluation scripts.
|
| 9 |
+
- External ASR models/tools: `asr/README.md`
|
| 10 |
|
| 11 |
## π TTS
|
| 12 |
- Baselines, speaker/style control, and quality assessment.
|
| 13 |
+
- External TTS models/tools: `tts/README.md`
|
| 14 |
|
| 15 |
## π§ͺ Benchmarks
|
| 16 |
- Fixed test set, metric definitions, and leaderboard process.
|
| 17 |
+
- Benchmark resources: `benchmarks/README.md`
|
| 18 |
|
| 19 |
## π₯οΈ Applications
|
| 20 |
- Desktop and API integrations for real-user testing.
|
| 21 |
+
- Integration resources: `apps/desktop/README.md`
|
| 22 |
+
|
| 23 |
+
## π Master Resource Index
|
| 24 |
+
- Full validated list: `docs/resource_catalog.md`
|
tts/README.md
CHANGED
|
@@ -1,3 +1,26 @@
|
|
| 1 |
# π TTS Workspace
|
| 2 |
|
| 3 |
Place TTS baselines, training configs, and quality-evaluation scripts here.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# π TTS Workspace
|
| 2 |
|
| 3 |
Place TTS baselines, training configs, and quality-evaluation scripts here.
|
| 4 |
+
|
| 5 |
+
## β
Verified Pashto-Relevant TTS Resources
|
| 6 |
+
|
| 7 |
+
### π Meta MMS Coverage (ASR + TTS language support)
|
| 8 |
+
- Coverage page: `https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html`
|
| 9 |
+
- Pashto validation: row includes `pus` with TTS support.
|
| 10 |
+
- Use in this repo: multilingual transfer baseline for Pashto synthesis.
|
| 11 |
+
- Applications: baseline voice generation, pronunciation checks, accessibility tools.
|
| 12 |
+
|
| 13 |
+
### π§ͺ Meta MMS TTS Model Collection
|
| 14 |
+
- Model card: `https://huggingface.co/facebook/mms-tts`
|
| 15 |
+
- Why useful: broad multilingual TTS package with language-specific checkpoints.
|
| 16 |
+
- Use in this repo: evaluate Pashto synthesis quality against curated text prompts.
|
| 17 |
+
|
| 18 |
+
### π οΈ Coqui TTS Toolkit
|
| 19 |
+
- Repo: `https://github.com/coqui-ai/TTS`
|
| 20 |
+
- Why useful: open training/inference toolkit to fine-tune custom Pashto voices.
|
| 21 |
+
- Use in this repo: reproducible TTS training scripts and quality A/B experiments.
|
| 22 |
+
|
| 23 |
+
## π§© Integration Hints
|
| 24 |
+
- Keep text normalization consistent between ASR and TTS experiments.
|
| 25 |
+
- Pair objective metrics with human listening checks in benchmark notes.
|
| 26 |
+
- Document voice style, speaker metadata, and license provenance for every release.
|