# Task 1 Dev Leaderboard Local tooling for a Task 1 style development leaderboard. Important policy note: - The FinMMEval Hugging Face collection is the official public data release for Task 1. Participants may use those released datasets for training and may reorganize or re-split them as needed. - The dev sets served by this leaderboard are separate organizer-held evaluation data. They are intended for validation only and should not be added back into training. - The remaining hidden test sets are reserved for final evaluation. The goal is to make it easy to: - register one or more MCQ datasets - register one or more prediction files from local model runs - compute overall and subgroup accuracy - export a lightweight local leaderboard - deploy the Task 1 portal either locally or on Hugging Face Spaces ## Current Scope This scaffold is deliberately conservative: - it assumes Task 1 style multiple-choice data - it does not require confidence scores - it uses only the Python standard library The included starter config uses a local Arabic MCQ set from `SAHM_private`: - dataset: `SAHM_private/data/islamic_finance/fatwa/fatwa_mcq.jsonl` - runs: `SAHM_private/runs/*_mcq/per_item_mcq.csv` The loader also supports local parquet MCQ datasets, which is useful for private Hugging Face data drops such as `Raniahossam33/accounting_CLEF`. There is also a ready-to-edit template at `task1_dev_leaderboard/configs/accounting_clef_template.json`. ## Quick Start From the repo root: ```bash python task1_dev_leaderboard/build_leaderboard.py \ --config task1_dev_leaderboard/configs/sahm_local.json ``` Outputs go to: ```bash task1_dev_leaderboard/outputs/sahm_local/ ``` Main files: - `leaderboard_overall.csv` - `leaderboard_by_dataset.csv` - `leaderboard_by_category.csv` - `per_item_results.csv` - `README.md` ## Released Dev Sets At present, the English, Chinese, Arabic, and Hindi Task 1 dev leaderboards are public. Public files: - `task1_dev_leaderboard/dev_sets/english_task1_final_public.jsonl` - `task1_dev_leaderboard/dev_sets/english_task1_final_submission_template.json` - `task1_dev_leaderboard/dev_sets/chinese_task1_final_public.jsonl` - `task1_dev_leaderboard/dev_sets/chinese_task1_final_submission_template.json` - `task1_dev_leaderboard/dev_sets/arabic_task1_final_public.jsonl` - `task1_dev_leaderboard/dev_sets/arabic_task1_final_submission_template.json` - `task1_dev_leaderboard/dev_sets/hindi_mcq_100_public.jsonl` - `task1_dev_leaderboard/dev_sets/hindi_mcq_100_submission_template.json` Organizer-private files: - `task1_dev_leaderboard/private/english_task1_final_gold.jsonl` - `task1_dev_leaderboard/private/english_task1_final_hidden_test.jsonl` - `task1_dev_leaderboard/private/chinese_task1_final_gold.jsonl` - `task1_dev_leaderboard/private/chinese_task1_final_hidden_test.jsonl` - `task1_dev_leaderboard/private/arabic_task1_final_gold.jsonl` - `task1_dev_leaderboard/private/arabic_task1_final_hidden_test.jsonl` - `task1_dev_leaderboard/private/hindi_mcq_100_gold.jsonl` - `task1_dev_leaderboard/private/hindi_mcq_hidden_test.jsonl` Sampling policy: - deterministic - seed = `2026` - English: 70 public dev examples and 200 hidden test examples - Chinese: 71 public dev examples and 200 hidden test examples - Arabic: 97 public dev examples and 200 hidden test examples after conservative question-level de-duplication - Hindi: 100 public dev examples and 200 hidden test examples ### Participant Submission Format Participants can submit either `.json` or `.jsonl`. For public deployments, organizers may require a private team submission code. When enabled, the code determines the canonical team slug and display name used on the leaderboard. Minimal per-item schema: ```json { "id": "ar-task1-final-dev-001", "prediction": "B" } ``` Accepted prediction field names: - `prediction` - `pred_letter` - `answer` - `label` For `.json`, the evaluator accepts: - a list of objects - an object with a `predictions` list - a flat mapping from `id -> prediction` ### Organizer Evaluation Workflow Put participant files under: ```bash task1_dev_leaderboard/submissions/ ``` Run: ```bash python task1_dev_leaderboard/evaluate_submissions.py \ --gold-file task1_dev_leaderboard/private/arabic_task1_final_gold.jsonl \ --submissions-dir task1_dev_leaderboard/submissions \ --out-dir task1_dev_leaderboard/outputs/arabic_task1_final_dev ``` This writes: - `leaderboard_overall.csv` - `per_item_results.csv` - one `*__validation.json` per submission - a rendered `README.md` `leaderboard_by_source.csv` is only created for older gold files that include a non-empty `source` field. ### Auto-Refresh Mode If you want organizer-side near-real-time leaderboard refresh, run the watcher: ```bash python task1_dev_leaderboard/watch_submissions.py \ --gold-file task1_dev_leaderboard/private/arabic_task1_final_gold.jsonl \ --submissions-dir task1_dev_leaderboard/submissions \ --out-dir task1_dev_leaderboard/outputs/arabic_task1_final_dev \ --run-on-start ``` Behavior: - watches `submissions/` for new or updated `.json` / `.jsonl` files - waits briefly for files to finish writing - reruns evaluation automatically - refreshes leaderboard outputs in place - writes `watch_status.json` with watcher state and last run result This gives you organizer-side automatic ranking updates. If you later want a public live page, you can point a static site, Hugging Face Space, or dashboard script at the files in `task1_dev_leaderboard/outputs/arabic_task1_final_dev/`. ## Web Portal There is also a lightweight web app with: - participant upload page - leaderboard page - upload API that immediately reruns evaluation The portal now supports two storage modes: - `local`: keep submissions, gold, and leaderboard files on the local filesystem - `hf_dataset`: keep organizer-private state in a private Hugging Face dataset repo while serving the public web app from a Hugging Face Space Start it from the repo root: ```bash uvicorn task1_dev_leaderboard.web_app:app --host 0.0.0.0 --port 8091 ``` Pages: - `/task1/dev` - `/task1/dev/submit` - `/task1/dev/leaderboard` Key API routes: - `POST /api/task1/submissions` - `GET /api/task1/leaderboard` - `GET /api/task1/devset/download` - `GET /api/task1/template/download` Current behavior: - each team uploads one `.json` or `.jsonl` file - re-submitting from the same team replaces the previous file - the leaderboard is refreshed immediately after a successful upload - the leaderboard page polls the latest results automatically ### Hugging Face Storage Mode Set these environment variables before starting the app: ```bash export TASK1_STORAGE_BACKEND=hf_dataset export TASK1_HF_REPO_ID=/ export HF_TOKEN= ``` Recommended layout: - public Hugging Face Space for the web app - private Hugging Face dataset repo for: - organizer gold file - participant submissions - leaderboard outputs The public dev set and submission template remain in the app repo. The private gold file does not. ### Bootstrap The Private HF Dataset Repo Create and initialize the organizer-private dataset repo: ```bash python task1_dev_leaderboard/bootstrap_hf_backend.py \ --repo-id / ``` By default the dataset repo is created as private. This is the recommended setting. ### Prepare A Standalone HF Space Repo Generate a clean Space folder that contains only the files needed by the portal: ```bash python task1_dev_leaderboard/prepare_hf_space.py ``` This creates: ```bash task1_dev_leaderboard_hf_space/ ``` That folder contains: - a Docker Space `README.md` - a `Dockerfile` - a minimal `requirements.txt` - the public `task1_dev_leaderboard` app package Push that generated folder to a new Hugging Face Space repo. ### Hugging Face Space Runtime Notes - use a `Docker Space` - set the Space secret `HF_TOKEN` - set the Space variables: - `TASK1_STORAGE_BACKEND=hf_dataset` - `TASK1_HF_REPO_ID=/` - the app runs fine on free CPU hardware - the private dataset repo should remain private because it stores gold labels and submissions ## Supported Input Formats ### Dataset format: `mcq_jsonl` Expected JSONL fields by default: ```json { "id": 0, "question": "...", "option_A": "...", "option_B": "...", "option_C": "...", "option_D": "...", "answer": "A", "category": "..." } ``` You can override field names in the config. ### Dataset format: `mcq_parquet` Useful when the dataset comes as a parquet file. Example schema: ```json { "id": 1, "question": "...", "choice_a": "...", "choice_b": "...", "choice_c": "...", "choice_d": "...", "choice_e": "...", "correct_answer": "c" } ``` Map the fields in config, for example: ```json { "key": "arabic_accounting_clef", "format": "mcq_parquet", "path": "../data/accounting_CLEF/data/train-00000-of-00001.parquet", "answer_field": "correct_answer", "option_fields": { "A": "choice_a", "B": "choice_b", "C": "choice_c", "D": "choice_d", "E": "choice_e" } } ``` ### Prediction format: `per_item_mcq_csv` Expected CSV fields by default: ```text id,category,gold_letter,pred_letter,is_correct,raw_output ``` Only `id` and `pred_letter` are required for scoring. ### Prediction format: `task1_jsonl` Useful for future Task 1 runs. Each line should contain: ```json { "id": "...", "prediction": "B" } ``` Accepted prediction field candidates by default: - `prediction` - `pred_letter` - `answer` - `answer_label` - `label` ## Config Shape ```json { "datasets": [ { "key": "dataset_key", "name": "Human-readable name", "format": "mcq_jsonl", "path": "relative/or/absolute/path.jsonl", "language": "en" } ], "runs": [ { "model_name": "model_x", "format": "per_item_mcq_csv", "path": "relative/or/absolute/path.csv", "datasets": ["dataset_key"] } ] } ``` Paths are resolved relative to the config file. ## Recommended Next Step As more internal Task 1-style datasets become available, add them as new dataset entries instead of hardcoding logic into the script. That keeps the leaderboard reproducible and makes it easier to separate: - official-like dev sets - auxiliary private sets - language-specific stress tests