lordx64 commited on
Commit
ea97e8f
Β·
verified Β·
1 Parent(s): 44aabe8

Card: full benchmark table with status column (all 🚧 in progress) + methodology notes

Browse files
Files changed (1) hide show
  1. README.md +23 -13
README.md CHANGED
@@ -135,19 +135,29 @@ Content domain: web/game development, Three.js scenes, multiplayer FPS prototype
135
 
136
  ## Evaluation
137
 
138
- > **Pending the first eval batch.** Numbers will appear here when the lm-eval-harness + agent-coding eval suites complete.
139
-
140
- Planned benchmarks:
141
-
142
- | Benchmark | Why | Expected signal |
143
- |---|---|---|
144
- | GSM8K-CoT | Verify reasoning prior preserved through second SFT | Should match or beat the Opus 4.7 distill (~84% flexible) |
145
- | MMLU-Pro | General knowledge reasoning | Same as above (~75%) |
146
- | HumanEval / MBPP | Pure code completion (non-agentic) | Baseline check |
147
- | SWE-bench Lite (or BCB-Hard) | Agentic coding ability | This is where Fable-5 SFT should show gains over the Opus 4.7 base |
148
- | `qwen3-6-distill-eval` Space | Side-by-side qualitative comparison vs base + Claude/Kimi distills | Live at [`lordx64/qwen3-6-distill-eval`](https://huggingface.co/spaces/lordx64/qwen3-6-distill-eval) |
149
-
150
- Following our project convention: numbers stay blank until verified. If a benchmark hits a known extraction bug (e.g. AIME strict-match returning 0 because the model outputs `**Answer: N**` instead of `\boxed{N}`), we omit it rather than publish a misleading score.
 
 
 
 
 
 
 
 
 
 
151
 
152
  ## Usage
153
 
 
135
 
136
  ## Evaluation
137
 
138
+ > 🚧 **Evals are in progress.** This table will fill in as each suite completes; nothing here is published until verified.
139
+
140
+ | Benchmark | Setup | Tests | Score | Status |
141
+ |---|---|---|---:|---|
142
+ | **GSM8K-CoT** | 8-shot, multi-turn, limit 300 | Grade-school math; verify reasoning prior preserved through the second SFT round | _pending_ | 🚧 in progress |
143
+ | **MMLU-Pro** | 5-shot, multi-turn, limit 500 | Hard multi-subject knowledge reasoning | _pending_ | 🚧 in progress |
144
+ | **MMLU-Pro** (per-subject) | Same as above | Biology / Math / Psychology / etc. breakdown | _pending_ | 🚧 in progress |
145
+ | **GPQA Diamond** | 0-shot CoT | Graduate-level STEM | _pending_ | 🚧 in progress |
146
+ | **MATH-500** | 0-shot, `math_verify` metric | Competition math; tests reasoning depth | _pending_ | 🚧 in progress |
147
+ | **AIME 2024 / 2025** | 0-shot CoT | Olympiad-level math; sensitivity to answer-extraction | _pending_ | 🚧 in progress |
148
+ | **HumanEval / MBPP** | pass@1 / pass@10 | Pure code completion (non-agentic baseline) | _pending_ | 🚧 in progress |
149
+ | **IFEval** | 0-shot | Instruction-following adherence | _pending_ | 🚧 in progress |
150
+ | **SWE-bench Lite** (or BCB-Hard) | with agent harness + tool registry | **The key test**: agentic coding ability vs Opus 4.7 base | _pending_ | 🚧 in progress |
151
+ | **`qwen3-6-distill-eval` Space** | 17 head-to-head prompts (12 design + 5 agentic) | Side-by-side qualitative comparison vs Qwen3.6 base + Opus 4.7 + Kimi K2.6 distills, with human-readable HTML output | _pending_ | 🚧 in progress |
152
+
153
+ Methodology used (same as the Opus 4.7 / Kimi K2.6 evals on this project):
154
+ - vLLM serving at 64k context so reasoning chains never truncate before answering
155
+ - `<think>…</think>` stripped before regex extractors run (otherwise extractors grab letters/numbers from inside the reasoning, not the final answer)
156
+ - Per-task `num_fewshot` (lm-eval's single global value can't handle GSM8K-8shot + GPQA-0shot together)
157
+ - `fewshot_as_multiturn=True` for chat-template fidelity
158
+ - `math_verify` metric for `MATH-500` and `AIME` (catches semantic equivalence; raw `strict-match` against `\boxed{N}` returns 0% even on correct answers because the model says `**Answer: N**`)
159
+
160
+ Standing rule on this project: **numbers stay blank until verified**. If a benchmark hits a known extraction bug we couldn't cleanly fix, the row says so and we omit the score rather than publish a misleading one.
161
 
162
  ## Usage
163