Gradient Extrapolation-Based Policy Optimization

📰 ArXiv cs.AI

arXiv:2605.06755v1 Announce Type: cross Abstract: Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-upd

Published 11 May 2026

Read full paper → ← Back to Reads