📈 Scaling Analysis
Explore how model performance scales with interaction turns, token usage, and inference cost.
MIMIC
10-K
GLOBEM
🏆 Ranking Comparison
Compare model rankings based on Bradley-Terry pairwise ranking against accuracy ranking.
MIMIC
10-K
GLOBEM
🔄 Turn Distribution
Analyze the distribution of interaction turns across different models and datasets.
MIMIC
10-K
GLOBEM
⚠️ Error Analysis
Breakdown of error types encountered during agent interactions, grouped by main categories.
🔍 Probing Results
Analyze the average log probability of FINISH messages across conversation turns and progress.