AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

📰 ArXiv cs.AI

arXiv:2601.11044v3 Announce Type: replace Abstract: Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this

Published 14 Apr 2026

Read full paper → ← Back to Reads