🔬 Research Benchmark

DDR-Bench

Deep Data Research Agent Benchmark for Large Language Models

A comprehensive evaluation framework measuring AI agents' ability to conduct deep, iterative data exploration across medical records (MIMIC), financial filings (10-K), and behavioral data (GLOBEM).

22+ Models Evaluated
3 Diverse Datasets
5 Analysis Dimensions

Scaling Analysis

Explore how model performance scales with interaction turns, token usage, and inference cost across all datasets.

MIMIC

10-K

GLOBEM

Novelty vs Accuracy Ranking

Compare model rankings based on Bradley-Terry pairwise ranking against traditional accuracy ranking.

MIMIC

10-K

GLOBEM

Turn Count Distribution

Analyze the distribution of interaction turns across different models and datasets.

MIMIC

10-K

GLOBEM

Error Type Analysis

Breakdown of error types encountered during agent interactions, grouped by main categories.

FINISH Token Probing

Analyze the average log probability of FINISH messages across conversation turns and progress.

MIMIC

GLOBEM

10-K