AgentMeter: Evaluating Model-CLI Matching for CLI-Based Local Task-Solving Agents
📰 ArXiv cs.AI
arXiv:2606.21140v1 Announce Type: cross Abstract: LLM agents increasingly solve local tasks through command-line and CLI-based harness interfaces, including code editing, repository inspection, data analysis, and file workflows. Existing evaluations often emphasize task success, but deployed local agents are not models alone: the CLI mediates prompts, context replay, tool outputs, file access, terminal observations, and stopping behavior. As a result, the same model can produce different success
DeepCamp AI