Reinforcement Learning via Value Gradient Flow

📰 ArXiv cs.AI

arXiv:2604.14265v1 Announce Type: cross Abstract: We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly con

Published 17 Apr 2026

Read full paper → ← Back to Reads