Scaling Analysis
Explore how model performance scales with interaction turns, token usage, and inference cost across all datasets.
MIMIC
10-K
GLOBEM
Novelty vs Accuracy Ranking
Compare model rankings based on Bradley-Terry pairwise ranking against traditional accuracy ranking.
MIMIC
10-K
GLOBEM
Turn Count Distribution
Analyze the distribution of interaction turns across different models and datasets.
MIMIC
10-K
GLOBEM
Error Type Analysis
Breakdown of error types encountered during agent interactions, grouped by main categories.
FINISH Token Probing
Analyze the average log probability of FINISH messages across conversation turns and progress.