musaw
Sync main snapshot to Hugging Face (no local binary banner)
2f53244
|
Raw
History Blame
11.1 kB

Datasets

Verified Pashto Resources

Resource Link Pashto Evidence Primary Use
99 Hours Pashto Spontaneous Dialogue Smartphone Speech Dataset huggingface Dataset title explicitly includes Pashto and API metadata marks audio and text modalities. (Pashto) Spontaneous speech ASR training and robustness evaluation
aamirhs/pashto-audio-wav2vec huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Pashto ASR data exploration and baseline training
adnankhan769/proper_dataset_english_2_pashto huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Machine translation and bilingual corpus development
AliMuhammad73/Pashto-Poetry huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Pashto poetry corpus for language modeling and text analysis
alpaca-pashto-cleaned huggingface Dataset metadata includes language:ps and dataset name includes Pashto. (ps, Pashto) Pashto instruction tuning and conversational NLP experiments
Belebele huggingface Dataset includes pbt_Arab subset. (pbt_Arab) Comprehension and multilingual NLP benchmark
Common Voice 24.0: Pashto Speech Dataset kaggle Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) ASR training and evaluation data source
Common Voice Scripted Speech 24.0 - Pashto mozilla Official dataset page is for Pashto. (Pashto) ASR training and evaluation
English to Pashto Sentences Dataset huggingface Dataset ID explicitly states English-to-Pashto and includes Pashto-script sentence column. (Pashto) MT and bilingual sentence alignment baseline
English-Pashto Language Dataset (EPLD) kaggle Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) Machine translation and bilingual corpus development
Google FLEURS huggingface Dataset config includes ps_af. (ps_af) Speech benchmark and external evaluation
IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY dataverse Dataverse metadata includes Pashto markers in dataset title or description. (pashto) Pashto speech dataset for ASR and language identification experiments
ihanif/pashto_asr_wer huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) ASR training and evaluation data source
ihanif/pashto_speech_20k huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) ASR training and evaluation data source
ihanif/pashto_speech_5k huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) ASR training and evaluation data source
ihanif/pashto_speech_ds huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) ASR training and evaluation data source
ihanif/pashto_speech_parquet_10k huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) ASR training and evaluation data source
Katib's Pashto Text Imagebase (KPTI) kaggle Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) OCR training and evaluation data source
oowais/pushto-text-to-speech-dataset huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) ASR training and evaluation data source
OPUS-100 huggingface Dataset viewer includes en-ps split. (en-ps) Machine translation training and evaluation
OSCAR Corpus huggingface Dataset includes unshuffled_deduplicated_ps split. (unshuffled_deduplicated_ps) Language modeling and lexicon expansion
Pashto English Bilingual Sentiment Corpus kaggle Kaggle dataset title and description identify the corpus as Pashto-English sentiment data. (Pashto) Sentiment analysis and bilingual NLP experiments
Pashto Isolated Words Speech Dataset kaggle Dataset title explicitly states Pashto speech dataset. (Pashto) Keyword spotting and constrained ASR experiments
Pashto OCR kaggle Kaggle dataset title/subtitle includes Pashto keyword. (Pashto) OCR training and evaluation data source
Pashto Wikipedia Corpus huggingface Dataset metadata includes language:ps and the title specifies Pashto corpus. (ps, Pashto) Pashto text corpus for NLP baselines
Pashto Word Embeddings kaggle Dataset description states pretrained Pashto embeddings. (Pashto) Lexical semantics and lightweight NLP baselines
PashtoOCR (Kaggle) kaggle Kaggle dataset title and subtitle explicitly identify a Pashto OCR dataset. (Pashto, OCR) Pashto OCR dataset benchmarking and training
POLD - Pashto Offensive Language Dataset kaggle Kaggle title and description explicitly state Pashto offensive language benchmark dataset. (Pashto) Pashto toxicity and moderation NLP benchmarks
saillab/alpaca_pashto_taco huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Instruction tuning and LLM adaptation data source
SherwinDesouza/pashto-common-voice-20 huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Pashto data source for NLP experimentation
tasal9/Pashto_Dataset huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Pashto data source for NLP experimentation
tasal9/ZamAI_Pashto_Dataset huggingface Matched by Pashto keyword in Hugging Face search results. (pashto) Pashto data source for NLP experimentation
Urdu-Pashto Lexicon Dataset kaggle Kaggle metadata describes 7,601 Urdu entries with Pashto translations. (Pashto) Lexicon and translation lexeme mapping
Wikimedia Wikipedia huggingface Dataset includes 20231101.ps subset. (20231101.ps) Terminology and balanced text corpus
Zirak-AI PashtoOCR huggingface Dataset tags include language:ps and the dataset name is PashtoOCR. (ps, PashtoOCR) OCR and text extraction benchmarking

Maintenance