🔬 Research Benchmark

DDR-Bench

Deep Data Research Agent Benchmark for Large Language Models

A comprehensive evaluation framework measuring AI agents' ability to conduct deep, iterative data exploration across medical records (MIMIC), financial filings (10-K), and behavioral data (GLOBEM).

22+ Models Evaluated
3 Diverse Datasets
5 Analysis Dimensions

📈 Scaling Analysis

Explore how model performance scales with interaction turns, token usage, and inference cost.

MIMIC

10-K

GLOBEM

🏆 Ranking Comparison

Novelty (Bradley-Terry) vs Accuracy ranking. ● = Novelty, ◇ = Accuracy. Purple = Proprietary, Green = Open-source.

MIMIC

10-K

GLOBEM

🔄 Turn Distribution

Analyze the distribution of interaction turns across different models and datasets.

MIMIC

10-K

GLOBEM

🔬 Entropy Analysis

Scatter plot showing Access Entropy vs Coverage by model. Opacity represents accuracy. Higher entropy = more uniform access; Higher coverage = more fields explored.

GPT-5.2

Claude-4.5-Sonnet

Gemini-3-Flash

GLM-4.6

Qwen3-Next-80B-A3B

DeepSeek-V3.2

⚠️ Error Analysis

Breakdown of error types encountered during agent interactions, grouped by main categories.

🔍 Probing Results

Analyze the average log probability of FINISH messages across conversation turns and progress.

MIMIC

GLOBEM

10-K