How Robustly do LLMs Understand Execution Semantics?

📰 ArXiv cs.AI

arXiv:2604.16320v1 Announce Type: cross Abstract: LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67

Published 21 Apr 2026
Read full paper → ← Back to Reads