| 99 Hours Pashto Spontaneous Dialogue Smartphone Speech Dataset |
huggingface |
Dataset title explicitly includes Pashto and API metadata marks audio and text modalities. (Pashto) |
Spontaneous speech ASR training and robustness evaluation |
| aamirhs/pashto-audio-wav2vec |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Pashto ASR data exploration and baseline training |
| adnankhan769/proper_dataset_english_2_pashto |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Machine translation and bilingual corpus development |
| AliMuhammad73/Pashto-Poetry |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Pashto poetry corpus for language modeling and text analysis |
| alpaca-pashto-cleaned |
huggingface |
Dataset metadata includes language:ps and dataset name includes Pashto. (ps, Pashto) |
Pashto instruction tuning and conversational NLP experiments |
| Belebele |
huggingface |
Dataset includes pbt_Arab subset. (pbt_Arab) |
Comprehension and multilingual NLP benchmark |
| Common Voice 24.0: Pashto Speech Dataset |
kaggle |
Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) |
ASR training and evaluation data source |
| Common Voice Scripted Speech 24.0 - Pashto |
mozilla |
Official dataset page is for Pashto. (Pashto) |
ASR training and evaluation |
| English to Pashto Sentences Dataset |
huggingface |
Dataset ID explicitly states English-to-Pashto and includes Pashto-script sentence column. (Pashto) |
MT and bilingual sentence alignment baseline |
| English-Pashto Language Dataset (EPLD) |
kaggle |
Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) |
Machine translation and bilingual corpus development |
| Google FLEURS |
huggingface |
Dataset config includes ps_af. (ps_af) |
Speech benchmark and external evaluation |
| IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY |
dataverse |
Dataverse metadata includes Pashto markers in dataset title or description. (pashto) |
Pashto speech dataset for ASR and language identification experiments |
| ihanif/pashto_asr_wer |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
ASR training and evaluation data source |
| ihanif/pashto_speech_20k |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
ASR training and evaluation data source |
| ihanif/pashto_speech_5k |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
ASR training and evaluation data source |
| ihanif/pashto_speech_ds |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
ASR training and evaluation data source |
| ihanif/pashto_speech_parquet_10k |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
ASR training and evaluation data source |
| Katib's Pashto Text Imagebase (KPTI) |
kaggle |
Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) |
OCR training and evaluation data source |
| oowais/pushto-text-to-speech-dataset |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
ASR training and evaluation data source |
| OPUS-100 |
huggingface |
Dataset viewer includes en-ps split. (en-ps) |
Machine translation training and evaluation |
| OSCAR Corpus |
huggingface |
Dataset includes unshuffled_deduplicated_ps split. (unshuffled_deduplicated_ps) |
Language modeling and lexicon expansion |
| Pashto English Bilingual Sentiment Corpus |
kaggle |
Kaggle dataset title and description identify the corpus as Pashto-English sentiment data. (Pashto) |
Sentiment analysis and bilingual NLP experiments |
| Pashto Isolated Words Speech Dataset |
kaggle |
Dataset title explicitly states Pashto speech dataset. (Pashto) |
Keyword spotting and constrained ASR experiments |
| Pashto OCR |
kaggle |
Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) |
OCR training and evaluation data source |
| Pashto Wikipedia Corpus |
huggingface |
Dataset metadata includes language:ps and the title specifies Pashto corpus. (ps, Pashto) |
Pashto text corpus for NLP baselines |
| Pashto Word Embeddings |
kaggle |
Dataset description states pretrained Pashto embeddings. (Pashto) |
Lexical semantics and lightweight NLP baselines |
| PashtoOCR (Kaggle) |
kaggle |
Kaggle dataset title and subtitle explicitly identify a Pashto OCR dataset. (Pashto, OCR) |
Pashto OCR dataset benchmarking and training |
| POLD - Pashto Offensive Language Dataset |
kaggle |
Kaggle title and description explicitly state Pashto offensive language benchmark dataset. (Pashto) |
Pashto toxicity and moderation NLP benchmarks |
| saillab/alpaca_pashto_taco |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Instruction tuning and LLM adaptation data source |
| SherwinDesouza/pashto-common-voice-20 |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Pashto data source for NLP experimentation |
| tasal9/Pashto_Dataset |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Pashto data source for NLP experimentation |
| tasal9/ZamAI_Pashto_Dataset |
huggingface |
Matched by Pashto keyword in Hugging Face search results. (pashto) |
Pashto data source for NLP experimentation |
| Urdu-Pashto Lexicon Dataset |
kaggle |
Kaggle metadata describes 7,601 Urdu entries with Pashto translations. (Pashto) |
Lexicon and translation lexeme mapping |
| Wikimedia Wikipedia |
huggingface |
Dataset includes 20231101.ps subset. (20231101.ps) |
Terminology and balanced text corpus |
| Zirak-AI PashtoOCR |
huggingface |
Dataset tags include language:ps and the dataset name is PashtoOCR. (ps, PashtoOCR) |
OCR and text extraction benchmarking |