ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
📰 ArXiv cs.AI
arXiv:2510.12047v4 Announce Type: replace Abstract: Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while evaluation filters out inputs that violate them. As a result, generated code may achieve high pass@k scores while failing to enforce the preconditions that the task actually requires. To address this gap, we introd
DeepCamp AI