An Executable Benchmarking Suite for Tool-Using Agents

📰 ArXiv cs.AI

arXiv:2605.11030v1 Announce Type: cross Abstract: Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verificat

Published 13 May 2026

Read full paper → ← Back to Reads