📈 Scaling Analysis
Explore how model performance scales with interaction turns, token usage, and inference cost.
MIMIC
10-K
GLOBEM
🏆 Ranking Comparison
Novelty (Bradley-Terry) vs Accuracy ranking. ● = Novelty, ◇ = Accuracy. Purple = Proprietary, Green = Open-source.
MIMIC
10-K
GLOBEM
🔄 Turn Distribution
Analyze the distribution of interaction turns across different models and datasets.
MIMIC
10-K
GLOBEM
🔬 Entropy Analysis
Scatter plot showing Access Entropy vs Coverage by model. Opacity represents accuracy. Higher entropy = more uniform access; Higher coverage = more fields explored.
GPT-5.2
Claude-4.5-Sonnet
Gemini-3-Flash
GLM-4.6
Qwen3-Next-80B-A3B
DeepSeek-V3.2
⚠️ Error Analysis
Breakdown of error types encountered during agent interactions, grouped by main categories.
🔍 Probing Results
Analyze the average log probability of FINISH messages across conversation turns and progress.