-
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
Paper • 2606.22388 • Published • 85 -
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
Paper • 2606.12871 • Published • 10 -
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark
Paper • 2606.18648 • Published • 14
Octavi Grau
octavigrau
AI & ML interests
None yet
Recent Activity
updated a collection 1 day ago
awesome-agentic-benchmarks updated a collection 1 day ago
awesome-agentic-benchmarks