BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
📰 ArXiv cs.AI
arXiv:2604.24955v1 Announce Type: cross Abstract: As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriente
DeepCamp AI