maxidl commited on
Commit
2727392
Β·
verified Β·
1 Parent(s): 6a4eb89

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +51 -29
README.md CHANGED
@@ -1,60 +1,82 @@
1
  ---
2
- title: Eval Suite Visualization
3
  emoji: πŸ“Š
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: static
7
  pinned: false
 
 
 
8
  ---
9
 
10
- # Eval Suite Visualization
11
 
12
- A static web app for visualizing LLM evaluation scores. Data is loaded directly from a HuggingFace dataset ([ellamind/eval-scores](https://huggingface.co/datasets/ellamind/eval-scores)) using DuckDB-WASM β€” no preprocessing or backend required.
 
 
 
13
 
14
  ## Features
15
 
16
- - **Hierarchical task selection**: eval suite β†’ task group β†’ individual benchmark, with aggregate views
17
- - **Multiple metrics**: `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
18
- - **Model comparison**: toggle models on/off; separate checkpoint runs from baselines
19
- - **Auto chart type**: line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
20
- - **Multi-panel layout**: add multiple independent panels side by side
21
- - **Smoothing**: configurable moving average for line charts
22
- - **Export**: download charts as PNG or SVG
 
 
23
 
24
- ## Quick Start
25
 
26
- Serve the app with any static file server:
27
 
28
- ```bash
29
- python3 -m http.server 8080
30
- ```
31
 
32
- Then open `http://localhost:8080`. The app fetches the parquet data directly from HuggingFace on load.
33
 
34
- ## Project Structure
35
 
36
- ```
37
- index.html # Single-file web app (HTML + CSS + JS)
38
- config.yaml # Model color overrides
39
- README.md # HF Spaces metadata + docs
40
- ```
 
 
41
 
42
  ## Configuration
43
 
44
- Model colors can be customized in `config.yaml`:
45
 
46
  ```yaml
47
  model_colors:
48
- "D01": "#4361ee"
49
- "Qwen3 1.7B": "#6F53D1"
50
  ```
51
 
52
- Exact matches are checked first, then prefix matches. Models without a configured color get assigned one from a default palette.
53
 
54
- ## Deployment
 
 
 
 
55
 
56
- This app is deployed as a [Static HTML Space](https://huggingface.co/docs/hub/spaces-sdks-static) on Hugging Face. To deploy:
57
 
58
  ```bash
59
- huggingface-cli upload ellamind/eval-suite-visualization . . --repo-type space
 
 
 
 
 
 
 
 
 
 
60
  ```
 
1
  ---
2
+ title: ellamind base-eval
3
  emoji: πŸ“Š
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: static
7
  pinned: false
8
+ hf_oauth: true
9
+ hf_oauth_scopes:
10
+ - read-repos
11
  ---
12
 
13
+ # ellamind base-eval
14
 
15
+ Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM β€” no backend required.
16
+
17
+ - **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref)
18
+ - **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval)
19
 
20
  ## Features
21
 
22
+ - **Hierarchical task selection** β€” eval suite β†’ task group β†’ individual benchmark, with aggregate views
23
+ - **Multiple metrics** β€” `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
24
+ - **Model comparison** β€” toggle models on/off; separate checkpoint runs from baselines
25
+ - **Auto chart type** β€” line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
26
+ - **Multi-panel layout** β€” add multiple independent panels
27
+ - **Merge datasets** β€” append rows from additional HF datasets (including private ones via OAuth)
28
+ - **Smoothing** β€” configurable moving average for line charts
29
+ - **Benchmark goodness metrics** β€” per-task quality indicators below line charts
30
+ - **Export** β€” download charts as PNG or SVG
31
 
32
+ ## Merge Datasets
33
 
34
+ You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset.
35
 
36
+ For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.
 
 
37
 
38
+ ## Benchmark Goodness Metrics
39
 
40
+ Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half).
41
 
42
+ | Metric | What it measures | Green | Yellow | Red |
43
+ |---|---|---|---|---|
44
+ | **Monotonicity** | Spearman correlation between steps and score | β‰₯ 0.7 | 0.4–0.7 | < 0.4 |
45
+ | **Signal Strength** | Relative improvement over initial performance | β‰₯ 0.10 | 0.03–0.10 | < 0.03 |
46
+ | **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | β€” | β€” | β€” |
47
+ | **Ordering** | Kendall's Tau of model rankings between steps | β‰₯ 0.6 | 0.3–0.6 | < 0.3 |
48
+ | **Discrimination** | Std of scores across models at last checkpoint | β‰₯ 0.03 | 0.01–0.03 | < 0.01 |
49
 
50
  ## Configuration
51
 
52
+ Model colors in `config.yaml`:
53
 
54
  ```yaml
55
  model_colors:
56
+ "Qwen3 1.7B": "#9575CD"
57
+ "Gemma 3 4B": "#00B0FF"
58
  ```
59
 
60
+ ## Local Development
61
 
62
+ ```bash
63
+ python3 -m http.server 8080
64
+ ```
65
+
66
+ OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.
67
 
68
+ ## Deployment
69
 
70
  ```bash
71
+ pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
72
+ pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
73
+ pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
74
+ ```
75
+
76
+ ## Project Structure
77
+
78
+ ```
79
+ index.html # Single-file web app (HTML + CSS + JS)
80
+ config.yaml # Model color overrides
81
+ README.md # HF Spaces metadata + docs
82
  ```