File size: 1,246 Bytes
379266c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | # Pashto Normalization Policy v0.1
This starter policy defines simple, low-risk rules for text cleanup before
training ASR/TTS/NLP baselines.
## Scope
- Applies to sentence-level text in this repository.
- Prioritizes consistency over linguistic completeness.
- Keeps semantic meaning unchanged.
## Rules
1. Trim leading and trailing whitespace.
2. Collapse repeated internal spaces to a single space.
3. Remove zero-width/invisible spacing characters.
4. Remove elongation characters such as tatweel (`ـ`).
5. Use Arabic punctuation consistently in Pashto text:
- comma: `،`
- question mark: `؟`
- semicolon: `؛`
6. Keep sentence-final punctuation as a single character (avoid `!!`, `؟؟`).
7. Normalize quotation usage to one style per sentence (avoid mixed quote styles).
8. Normalize digit style to one standard per dataset split.
9. Preserve original word order and meaning; do not rewrite content.
10. Keep dialect wording as spoken; normalize form, not dialect identity.
## Non-goals (for v0.1)
- No stemming or morphology rules.
- No automatic transliteration.
- No named-entity rewriting.
## File Reference
- Seed examples: `data/processed/normalization_seed_v0.1.tsv`
- Validator: `scripts/validate_normalization.py`
|