pashto-language-resources / docs /resource_automation.md
musaw
Sync main snapshot to Hugging Face (no local binary banner)
2f53244
|
Raw
History Blame
2.11 kB
# Resource Automation
This repository uses a semi-automated process to keep Pashto resources current while preserving human review.
## Goals
- Discover new Pashto-relevant resources from trusted public endpoints.
- Keep a machine-readable canonical catalog.
- Prevent unreviewed low-confidence resources from directly entering verified lists.
## Covered source types
- Kaggle datasets
- Hugging Face datasets
- Hugging Face models
- Hugging Face Spaces (projects)
- GitHub repositories (projects and code)
- GitLab repositories (projects and code)
- Zenodo records
- Dataverse datasets
- DataCite DOI records
- Research-paper endpoints (arXiv, Semantic Scholar, OpenAlex, Crossref)
## Files involved
- Canonical verified catalog: [../resources/catalog/resources.json](../resources/catalog/resources.json)
- Candidate feed: [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json)
- Catalog schema: [../resources/schema/resource.schema.json](../resources/schema/resource.schema.json)
- Search export: [search/resources.json](search/resources.json)
## Scripts
- Validate catalog: `python scripts/validate_resource_catalog.py`
- Generate markdown and search index: `python scripts/generate_resource_views.py`
- Sync new candidates: `python scripts/sync_resources.py --limit 20`
- Full run wrapper: `python scripts/run_resource_cycle.py --limit 25`
## GitHub Actions
- CI (`.github/workflows/ci.yml`) enforces:
- catalog validation
- generated file consistency
- markdown link checks
- tests
- Resource Sync (`.github/workflows/resource_sync.yml`) runs daily and opens a PR with candidate updates.
## Review flow
1. Inspect candidate entries in `resources/catalog/pending_candidates.json`.
2. Select useful items and move them into `resources/catalog/resources.json`.
3. Set `status` to `verified` only after checking evidence and license.
4. Run:
- `python scripts/validate_resource_catalog.py`
- `python scripts/generate_resource_views.py`
5. Commit and open PR.
## Runbook
- Reusable process guide: [resource_cycle_runbook.md](resource_cycle_runbook.md)