CodeClash: Benchmarking Goal-Oriented Software Engineering

📰 ArXiv cs.AI

arXiv:2511.00839v2 Announce Type: replace-cross Abstract: Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better

Published 14 May 2026

Read full paper → ← Back to Reads