Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
📰 ArXiv cs.AI
arXiv:2603.07084v2 Announce Type: replace-cross Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates
DeepCamp AI