Interactive Evaluation Requires a Design Science

📰 ArXiv cs.AI

arXiv:2605.17829v1 Announce Type: new Abstract: AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resultin

Published 19 May 2026

Read full paper → ← Back to Reads