General Agent Evaluation

📰 ArXiv cs.AI

arXiv:2602.22953v2 Announce Type: replace Abstract: General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harne

Published 12 May 2026
Read full paper → ← Back to Reads