Resource Automation
This repository uses a semi-automated process to keep Pashto resources current while preserving human review.
Goals
- Discover new Pashto-relevant resources from trusted public endpoints.
- Keep a machine-readable canonical catalog.
- Prevent unreviewed low-confidence resources from directly entering verified lists.
Covered source types
- Kaggle datasets
- Hugging Face datasets
- Hugging Face models
- Hugging Face Spaces (projects)
- GitHub repositories (projects and code)
- GitLab repositories (projects and code)
- Zenodo records
- Dataverse datasets
- DataCite DOI records
- Research-paper endpoints (arXiv, Semantic Scholar, OpenAlex, Crossref)
Files involved
- Canonical verified catalog: ../resources/catalog/resources.json
- Candidate feed: ../resources/catalog/pending_candidates.json
- Catalog schema: ../resources/schema/resource.schema.json
- Search export: search/resources.json
Scripts
- Validate catalog:
python scripts/validate_resource_catalog.py - Generate markdown and search index:
python scripts/generate_resource_views.py - Sync new candidates:
python scripts/sync_resources.py --limit 20 - Full run wrapper:
python scripts/run_resource_cycle.py --limit 25
GitHub Actions
- CI (
.github/workflows/ci.yml) enforces:- catalog validation
- generated file consistency
- markdown link checks
- tests
- Resource Sync (
.github/workflows/resource_sync.yml) runs daily and opens a PR with candidate updates.
Review flow
- Inspect candidate entries in
resources/catalog/pending_candidates.json. - Select useful items and move them into
resources/catalog/resources.json. - Set
statustoverifiedonly after checking evidence and license. - Run:
python scripts/validate_resource_catalog.pypython scripts/generate_resource_views.py
- Commit and open PR.
Runbook
- Reusable process guide: resource_cycle_runbook.md