🔬 Research Benchmark

DDR-Bench

Deep Data Research Agent Benchmark for Large Language Models

A comprehensive evaluation framework measuring AI agents' ability to conduct deep, iterative data exploration across medical records (MIMIC), financial filings (10-K), and behavioral data (GLOBEM).

22+ Models Evaluated
3 Diverse Datasets
5 Analysis Dimensions

📈 Scaling Analysis

Explore how model performance scales with interaction turns, token usage, and inference cost.

MIMIC

10-K

GLOBEM

🏆 Ranking Comparison

Compare model rankings based on Bradley-Terry pairwise ranking against accuracy ranking.

MIMIC

10-K

GLOBEM

🔄 Turn Distribution

Analyze the distribution of interaction turns across different models and datasets.

MIMIC

10-K

GLOBEM

⚠️ Error Analysis

Breakdown of error types encountered during agent interactions, grouped by main categories.

🔍 Probing Results

Analyze the average log probability of FINISH messages across conversation turns and progress.

MIMIC

GLOBEM

10-K