-
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
Paper • 2606.22388 • Published • 78 -
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
Paper • 2606.12871 • Published • 8 -
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark
Paper • 2606.18648 • Published • 12
Octavi Grau
octavigrau
AI & ML interests
None yet
Recent Activity
updated a collection about 13 hours ago
awesome-agentic-benchmarks upvoted a paper about 13 hours ago
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark updated a collection about 13 hours ago
awesome-agentic-benchmarks