File size: 11,063 Bytes

f13fd7c
d2f0b77
f13fd7c
d2f0b77
 
 
16953e6
2f53244
9899fdf
2f53244
9899fdf
f13fd7c
9899fdf
f13fd7c
9899fdf
 
f13fd7c
2f53244
9899fdf
c55938e
 
9899fdf
 
 
c55938e
f13fd7c
 
081627f
f13fd7c
9899fdf
16953e6
f13fd7c
9899fdf
081627f
9899fdf
 
c55938e
9899fdf
081627f
f13fd7c
16953e6
d2f0b77
f13fd7c

# Datasets

## Verified Pashto Resources

| Resource | Link | Pashto Evidence | Primary Use |
|---|---|---|---|
| 99 Hours Pashto Spontaneous Dialogue Smartphone Speech Dataset | [huggingface](https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset) | [Dataset title explicitly includes Pashto and API metadata marks audio and text modalities. (`Pashto`)](https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset) | Spontaneous speech ASR training and robustness evaluation |
| aamirhs/pashto-audio-wav2vec | [huggingface](https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec) | Pashto ASR data exploration and baseline training |
| adnankhan769/proper_dataset_english_2_pashto | [huggingface](https://huggingface.co/datasets/adnankhan769/proper_dataset_english_2_pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/adnankhan769/proper_dataset_english_2_pashto) | Machine translation and bilingual corpus development |
| AliMuhammad73/Pashto-Poetry | [huggingface](https://huggingface.co/datasets/AliMuhammad73/Pashto-Poetry) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/AliMuhammad73/Pashto-Poetry) | Pashto poetry corpus for language modeling and text analysis |
| alpaca-pashto-cleaned | [huggingface](https://huggingface.co/datasets/saillab/alpaca-pashto-cleaned) | [Dataset metadata includes language:ps and dataset name includes Pashto. (`ps`, `Pashto`)](https://huggingface.co/api/datasets/saillab/alpaca-pashto-cleaned) | Pashto instruction tuning and conversational NLP experiments |
| Belebele | [huggingface](https://huggingface.co/datasets/facebook/belebele) | [Dataset includes pbt_Arab subset. (`pbt_Arab`)](https://huggingface.co/datasets/facebook/belebele) | Comprehension and multilingual NLP benchmark |
| Common Voice 24.0: Pashto Speech Dataset | [kaggle](https://www.kaggle.com/datasets/ataullahaali/common-voice-scripted-speech-24-0-pashto) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/ataullahaali/common-voice-scripted-speech-24-0-pashto) | ASR training and evaluation data source |
| Common Voice Scripted Speech 24.0 - Pashto | [mozilla](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | [Official dataset page is for Pashto. (`Pashto`)](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | ASR training and evaluation |
| English to Pashto Sentences Dataset | [huggingface](https://huggingface.co/datasets/adnankhan769/english_to_pashto_sentences_dataset) | [Dataset ID explicitly states English-to-Pashto and includes Pashto-script sentence column. (`Pashto`)](https://huggingface.co/api/datasets/adnankhan769/english_to_pashto_sentences_dataset) | MT and bilingual sentence alignment baseline |
| English-Pashto Language Dataset (EPLD) | [kaggle](https://www.kaggle.com/datasets/rabiakhan827/english-pashto-language-dataset-epld) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/rabiakhan827/english-pashto-language-dataset-epld) | Machine translation and bilingual corpus development |
| Google FLEURS | [huggingface](https://huggingface.co/datasets/google/fleurs) | [Dataset config includes ps_af. (`ps_af`)](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py) | Speech benchmark and external evaluation |
| IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY | [dataverse](https://hdl.handle.net/11272.1/AB2/GLFN3X) | [Dataverse metadata includes Pashto markers in dataset title or description. (`pashto`)](https://hdl.handle.net/11272.1/AB2/GLFN3X) | Pashto speech dataset for ASR and language identification experiments |
| ihanif/pashto_asr_wer | [huggingface](https://huggingface.co/datasets/ihanif/pashto_asr_wer) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/ihanif/pashto_asr_wer) | ASR training and evaluation data source |
| ihanif/pashto_speech_20k | [huggingface](https://huggingface.co/datasets/ihanif/pashto_speech_20k) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/ihanif/pashto_speech_20k) | ASR training and evaluation data source |
| ihanif/pashto_speech_5k | [huggingface](https://huggingface.co/datasets/ihanif/pashto_speech_5k) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/ihanif/pashto_speech_5k) | ASR training and evaluation data source |
| ihanif/pashto_speech_ds | [huggingface](https://huggingface.co/datasets/ihanif/pashto_speech_ds) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/ihanif/pashto_speech_ds) | ASR training and evaluation data source |
| ihanif/pashto_speech_parquet_10k | [huggingface](https://huggingface.co/datasets/ihanif/pashto_speech_parquet_10k) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/ihanif/pashto_speech_parquet_10k) | ASR training and evaluation data source |
| Katib's Pashto Text Imagebase (KPTI) | [kaggle](https://www.kaggle.com/datasets/hassanamin/katibs-pashto-text-imagebase-kpti) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/hassanamin/katibs-pashto-text-imagebase-kpti) | OCR training and evaluation data source |
| oowais/pushto-text-to-speech-dataset | [huggingface](https://huggingface.co/datasets/oowais/pushto-text-to-speech-dataset) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/oowais/pushto-text-to-speech-dataset) | ASR training and evaluation data source |
| OPUS-100 | [huggingface](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes en-ps split. (`en-ps`)](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Machine translation training and evaluation |
| OSCAR Corpus | [huggingface](https://huggingface.co/datasets/oscar-corpus/oscar) | [Dataset includes unshuffled_deduplicated_ps split. (`unshuffled_deduplicated_ps`)](https://huggingface.co/datasets/oscar-corpus/oscar) | Language modeling and lexicon expansion |
| Pashto English Bilingual Sentiment Corpus | [kaggle](https://www.kaggle.com/datasets/farhadkhan66/pashto-translated-corpus) | [Kaggle dataset title and description identify the corpus as Pashto-English sentiment data. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/farhadkhan66/pashto-translated-corpus) | Sentiment analysis and bilingual NLP experiments |
| Pashto Isolated Words Speech Dataset | [kaggle](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | [Dataset title explicitly states Pashto speech dataset. (`Pashto`)](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Keyword spotting and constrained ASR experiments |
| Pashto OCR | [kaggle](https://www.kaggle.com/datasets/hassanamin/pashto-ocr) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/hassanamin/pashto-ocr) | OCR training and evaluation data source |
| Pashto Wikipedia Corpus | [huggingface](https://huggingface.co/datasets/ihanif/pashto-wikipedia-corpus) | [Dataset metadata includes language:ps and the title specifies Pashto corpus. (`ps`, `Pashto`)](https://huggingface.co/datasets/ihanif/pashto-wikipedia-corpus) | Pashto text corpus for NLP baselines |
| Pashto Word Embeddings | [kaggle](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | [Dataset description states pretrained Pashto embeddings. (`Pashto`)](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Lexical semantics and lightweight NLP baselines |
| PashtoOCR (Kaggle) | [kaggle](https://www.kaggle.com/datasets/drijaz/pashtoocr) | [Kaggle dataset title and subtitle explicitly identify a Pashto OCR dataset. (`Pashto`, `OCR`)](https://www.kaggle.com/api/v1/datasets/view/drijaz/pashtoocr) | Pashto OCR dataset benchmarking and training |
| POLD - Pashto Offensive Language Dataset | [kaggle](https://www.kaggle.com/datasets/drijaz/pold-pashto-offensive-language-dataset) | [Kaggle title and description explicitly state Pashto offensive language benchmark dataset. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/drijaz/pold-pashto-offensive-language-dataset) | Pashto toxicity and moderation NLP benchmarks |
| saillab/alpaca_pashto_taco | [huggingface](https://huggingface.co/datasets/saillab/alpaca_pashto_taco) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/saillab/alpaca_pashto_taco) | Instruction tuning and LLM adaptation data source |
| SherwinDesouza/pashto-common-voice-20 | [huggingface](https://huggingface.co/datasets/SherwinDesouza/pashto-common-voice-20) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/SherwinDesouza/pashto-common-voice-20) | Pashto data source for NLP experimentation |
| tasal9/Pashto_Dataset | [huggingface](https://huggingface.co/datasets/tasal9/Pashto_Dataset) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/tasal9/Pashto_Dataset) | Pashto data source for NLP experimentation |
| tasal9/ZamAI_Pashto_Dataset | [huggingface](https://huggingface.co/datasets/tasal9/ZamAI_Pashto_Dataset) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/tasal9/ZamAI_Pashto_Dataset) | Pashto data source for NLP experimentation |
| Urdu-Pashto Lexicon Dataset | [kaggle](https://www.kaggle.com/datasets/shafeeqgigyani/urdu-pashto-lexicon-dataset) | [Kaggle metadata describes 7,601 Urdu entries with Pashto translations. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/shafeeqgigyani/urdu-pashto-lexicon-dataset) | Lexicon and translation lexeme mapping |
| Wikimedia Wikipedia | [huggingface](https://huggingface.co/datasets/wikimedia/wikipedia) | [Dataset includes 20231101.ps subset. (`20231101.ps`)](https://huggingface.co/datasets/wikimedia/wikipedia) | Terminology and balanced text corpus |
| Zirak-AI PashtoOCR | [huggingface](https://huggingface.co/datasets/zirak-ai/PashtoOCR) | [Dataset tags include language:ps and the dataset name is PashtoOCR. (`ps`, `PashtoOCR`)](https://huggingface.co/datasets/zirak-ai/PashtoOCR) | OCR and text extraction benchmarking |

## Maintenance
- Source of truth: [../catalog/resources.json](../catalog/resources.json)
- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)
- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)