The Amazing Agent Race: Strong Tool Users, Weak Navigators

📰 ArXiv cs.AI

arXiv:2604.10261v1 Announce Type: new Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execut

Published 14 Apr 2026

Read full paper → ← Back to Reads