pashto-language-resources / docs /resource_automation.md
musaw
Sync main snapshot to Hugging Face (no local binary banner)
2f53244
|
Raw
History Blame
2.11 kB

Resource Automation

This repository uses a semi-automated process to keep Pashto resources current while preserving human review.

Goals

  • Discover new Pashto-relevant resources from trusted public endpoints.
  • Keep a machine-readable canonical catalog.
  • Prevent unreviewed low-confidence resources from directly entering verified lists.

Covered source types

  • Kaggle datasets
  • Hugging Face datasets
  • Hugging Face models
  • Hugging Face Spaces (projects)
  • GitHub repositories (projects and code)
  • GitLab repositories (projects and code)
  • Zenodo records
  • Dataverse datasets
  • DataCite DOI records
  • Research-paper endpoints (arXiv, Semantic Scholar, OpenAlex, Crossref)

Files involved

Scripts

  • Validate catalog: python scripts/validate_resource_catalog.py
  • Generate markdown and search index: python scripts/generate_resource_views.py
  • Sync new candidates: python scripts/sync_resources.py --limit 20
  • Full run wrapper: python scripts/run_resource_cycle.py --limit 25

GitHub Actions

  • CI (.github/workflows/ci.yml) enforces:
    • catalog validation
    • generated file consistency
    • markdown link checks
    • tests
  • Resource Sync (.github/workflows/resource_sync.yml) runs daily and opens a PR with candidate updates.

Review flow

  1. Inspect candidate entries in resources/catalog/pending_candidates.json.
  2. Select useful items and move them into resources/catalog/resources.json.
  3. Set status to verified only after checking evidence and license.
  4. Run:
    • python scripts/validate_resource_catalog.py
    • python scripts/generate_resource_views.py
  5. Commit and open PR.

Runbook