Musawer14
/

pashto-language-resources

machine-translation

language-resources

low-resource-languages

speech-recognition

Model card Files Files and versions

pashto-language-resources / docs /resource_automation.md

musaw

Sync main snapshot to Hugging Face (no local binary banner)

2f53244 4 months ago

|

2.11 kB

Resource Automation

This repository uses a semi-automated process to keep Pashto resources current while preserving human review.

Goals

Discover new Pashto-relevant resources from trusted public endpoints.
Keep a machine-readable canonical catalog.
Prevent unreviewed low-confidence resources from directly entering verified lists.

Covered source types

Kaggle datasets
Hugging Face datasets
Hugging Face models
Hugging Face Spaces (projects)
GitHub repositories (projects and code)
GitLab repositories (projects and code)
Zenodo records
Dataverse datasets
DataCite DOI records
Research-paper endpoints (arXiv, Semantic Scholar, OpenAlex, Crossref)

Files involved

Canonical verified catalog: ../resources/catalog/resources.json
Candidate feed: ../resources/catalog/pending_candidates.json
Catalog schema: ../resources/schema/resource.schema.json
Search export: search/resources.json

Scripts

Validate catalog: python scripts/validate_resource_catalog.py
Generate markdown and search index: python scripts/generate_resource_views.py
Sync new candidates: python scripts/sync_resources.py --limit 20
Full run wrapper: python scripts/run_resource_cycle.py --limit 25

GitHub Actions

CI (.github/workflows/ci.yml) enforces:
- catalog validation
- generated file consistency
- markdown link checks
- tests
Resource Sync (.github/workflows/resource_sync.yml) runs daily and opens a PR with candidate updates.

Review flow

Inspect candidate entries in resources/catalog/pending_candidates.json.
Select useful items and move them into resources/catalog/resources.json.
Set status to verified only after checking evidence and license.
Run:
- python scripts/validate_resource_catalog.py
- python scripts/generate_resource_views.py
Commit and open PR.

Runbook

Reusable process guide: resource_cycle_runbook.md