An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning

📰 ArXiv cs.AI

arXiv:2605.11006v1 Announce Type: cross Abstract: Evaluating whether large language models (LLMs) can recover execution-relevant program structure, rather than only produce code that passes tests, remains an open problem. Existing code benchmarks emphasize test-passing outputs, from standalone programming tasks (HumanEval, MBPP, LiveCodeBench) to repository repair (SWE-Bench); this is useful, but offers limited diagnostic signal about which program semantics a model can recover from source. We i

Published 13 May 2026

Read full paper → ← Back to Reads