EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

📰 ArXiv cs.AI

arXiv:2604.19485v1 Announce Type: cross Abstract: Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that

Published 22 Apr 2026
Read full paper → ← Back to Reads