Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
📰 ArXiv cs.AI
arXiv:2505.04842v2 Announce Type: replace-cross Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. Yet if parallel test-time compute is already part of the deployment plan, training should be designed to support it. In this work, we propose RL$^V$ that a
DeepCamp AI