DDR-Bench | Deep Data Research Benchmark

Scaling Analysis

Explore how model performance scales with interaction turns, token usage, and inference cost across all datasets.

Compare model rankings based on Bradley-Terry pairwise ranking against traditional accuracy ranking.

Analyze the distribution of interaction turns across different models and datasets.

Breakdown of error types encountered during agent interactions, grouped by main categories.

Analyze the average log probability of FINISH messages across conversation turns and progress.