sramjee commited on
Commit
563450e
·
verified ·
1 Parent(s): f023ec8

RedlineBench results card

Browse files
Files changed (2) hide show
  1. .eval_results/redlinebench.yaml +10 -0
  2. README.md +13 -0
.eval_results/redlinebench.yaml ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: crosbylegal/RedlineBench
3
+ task_id: redline_overall
4
+ value: 44.4
5
+ date: "2026-06-17"
6
+ source:
7
+ url: https://intelligence.crosby.ai/benchmark/
8
+ name: RedlineBench report
9
+ user: crosbylegal
10
+ notes: "agent=claude-code; 3-LLM judge panel (majority vote); turn-weighted weighted pass rate (0-100); published report figure"
README.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Claude Opus 4.8 — RedlineBench results card
2
+
3
+ This is a **results-tracking model card** for the API model **Claude Opus 4.8**. It
4
+ holds no weights. Its purpose is to host evaluation results for
5
+ [RedlineBench](https://huggingface.co/datasets/crosbylegal/RedlineBench) on the
6
+ Hugging Face Hub, since Claude Opus 4.8 has no public model repository.
7
+
8
+ - **Benchmark:** [crosbylegal/RedlineBench](https://huggingface.co/datasets/crosbylegal/RedlineBench)
9
+ - **Report:** https://intelligence.crosby.ai/benchmark/
10
+ - **RedlineBench `redline_overall`:** 44.4
11
+
12
+ Results live in [`.eval_results/`](./.eval_results). Scores are attributed to the
13
+ published report (community/source — not HF-`verified`, which is inspect-ai-only).