Reinforcement Learning via Value Gradient Flow
📰 ArXiv cs.AI
arXiv:2604.14265v1 Announce Type: cross Abstract: We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly con
DeepCamp AI