Spaces:

ellamind
/

base-eval

Running

App Files Files Community

maxidl commited on Mar 25

Commit

2727392

verified ·

1 Parent(s): 6a4eb89

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +51 -29

README.md CHANGED Viewed

@@ -1,60 +1,82 @@
 ---
-title: Eval Suite Visualization
 emoji: 📊
 colorFrom: blue
 colorTo: indigo
 sdk: static
 pinned: false
 ---
-# Eval Suite Visualization
-A static web app for visualizing LLM evaluation scores. Data is loaded directly from a HuggingFace dataset ([ellamind/eval-scores](https://huggingface.co/datasets/ellamind/eval-scores)) using DuckDB-WASM — no preprocessing or backend required.
 ## Features
-- **Hierarchical task selection**: eval suite → task group → individual benchmark, with aggregate views
-- **Multiple metrics**: `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
-- **Model comparison**: toggle models on/off; separate checkpoint runs from baselines
-- **Auto chart type**: line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
-- **Multi-panel layout**: add multiple independent panels side by side
-- **Smoothing**: configurable moving average for line charts
-- **Export**: download charts as PNG or SVG
-## Quick Start
-Serve the app with any static file server:
-```bash
-python3 -m http.server 8080
-```
-Then open `http://localhost:8080`. The app fetches the parquet data directly from HuggingFace on load.
-## Project Structure
-```
-index.html    # Single-file web app (HTML + CSS + JS)
-config.yaml   # Model color overrides
-README.md     # HF Spaces metadata + docs
-```
 ## Configuration
-Model colors can be customized in `config.yaml`:
 ```yaml
 model_colors:
-  "D01": "#4361ee"
-  "Qwen3 1.7B": "#6F53D1"
 ```
-Exact matches are checked first, then prefix matches. Models without a configured color get assigned one from a default palette.
-## Deployment
-This app is deployed as a [Static HTML Space](https://huggingface.co/docs/hub/spaces-sdks-static) on Hugging Face. To deploy:
 ```bash
-huggingface-cli upload ellamind/eval-suite-visualization . . --repo-type space
 ```

 ---
+title: ellamind base-eval
 emoji: 📊
 colorFrom: blue
 colorTo: indigo
 sdk: static
 pinned: false
+hf_oauth: true
+hf_oauth_scopes:
+  - read-repos
 ---
+# ellamind base-eval
+Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM — no backend required.
+- **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref)
+- **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval)
 ## Features
+- **Hierarchical task selection** — eval suite → task group → individual benchmark, with aggregate views
+- **Multiple metrics** — `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
+- **Model comparison** — toggle models on/off; separate checkpoint runs from baselines
+- **Auto chart type** — line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
+- **Multi-panel layout** — add multiple independent panels
+- **Merge datasets** — append rows from additional HF datasets (including private ones via OAuth)
+- **Smoothing** — configurable moving average for line charts
+- **Benchmark goodness metrics** — per-task quality indicators below line charts
+- **Export** — download charts as PNG or SVG
+## Merge Datasets
+You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset.
+For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.
+## Benchmark Goodness Metrics
+Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half).
+| Metric | What it measures | Green | Yellow | Red |
+|---|---|---|---|---|
+| **Monotonicity** | Spearman correlation between steps and score | ≥ 0.7 | 0.4–0.7 | < 0.4 |
+| **Signal Strength** | Relative improvement over initial performance | ≥ 0.10 | 0.03–0.10 | < 0.03 |
+| **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | — | — | — |
+| **Ordering** | Kendall's Tau of model rankings between steps | ≥ 0.6 | 0.3–0.6 | < 0.3 |
+| **Discrimination** | Std of scores across models at last checkpoint | ≥ 0.03 | 0.01–0.03 | < 0.01 |
 ## Configuration
+Model colors in `config.yaml`:
 ```yaml
 model_colors:
+  "Qwen3 1.7B": "#9575CD"
+  "Gemma 3 4B": "#00B0FF"
 ```
+## Local Development
+```bash
+python3 -m http.server 8080
+```
+OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.
+## Deployment
 ```bash
+pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
+pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
+pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
+```
+## Project Structure
+```
+index.html    # Single-file web app (HTML + CSS + JS)
+config.yaml   # Model color overrides
+README.md     # HF Spaces metadata + docs
 ```