ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

📰 ArXiv cs.AI

arXiv:2510.12047v4 Announce Type: replace Abstract: Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while evaluation filters out inputs that violate them. As a result, generated code may achieve high pass@k scores while failing to enforce the preconditions that the task actually requires. To address this gap, we introd

Published 21 Apr 2026

Read full paper → ← Back to Reads