crosbylegal
/

claude-opus-4-8

Model card Files Files and versions

sramjee commited on 14 days ago

Commit

563450e

·

verified ·

1 Parent(s): f023ec8

RedlineBench results card

Files changed (2) hide show

.eval_results/redlinebench.yaml +10 -0
README.md +13 -0

.eval_results/redlinebench.yaml ADDED Viewed

	@@ -0,0 +1,10 @@

+- dataset:
+    id: crosbylegal/RedlineBench
+    task_id: redline_overall
+  value: 44.4
+  date: "2026-06-17"
+  source:
+    url: https://intelligence.crosby.ai/benchmark/
+    name: RedlineBench report
+    user: crosbylegal
+  notes: "agent=claude-code; 3-LLM judge panel (majority vote); turn-weighted weighted pass rate (0-100); published report figure"

README.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# Claude Opus 4.8 — RedlineBench results card
+This is a **results-tracking model card** for the API model **Claude Opus 4.8**. It
+holds no weights. Its purpose is to host evaluation results for
+[RedlineBench](https://huggingface.co/datasets/crosbylegal/RedlineBench) on the
+Hugging Face Hub, since Claude Opus 4.8 has no public model repository.
+- **Benchmark:** [crosbylegal/RedlineBench](https://huggingface.co/datasets/crosbylegal/RedlineBench)
+- **Report:** https://intelligence.crosby.ai/benchmark/
+- **RedlineBench `redline_overall`:** 44.4
+Results live in [`.eval_results/`](./.eval_results). Scores are attributed to the
+published report (community/source — not HF-`verified`, which is inspect-ai-only).