Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark Paper • 2606.18648 • Published 8 days ago • 14
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks Paper • 2606.12871 • Published 14 days ago • 10
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 4 days ago • 85