DDR-Bench | Deep Data Research Benchmark

📈 Scaling Analysis

Explore how model performance scales with interaction turns, token usage, and inference cost.

Compare model rankings based on Bradley-Terry pairwise ranking against accuracy ranking.

Analyze the distribution of interaction turns across different models and datasets.

Breakdown of error types encountered during agent interactions, grouped by main categories.

Analyze the average log probability of FINISH messages across conversation turns and progress.