Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
📰 ArXiv cs.AI
arXiv:2605.20555v1 Announce Type: cross Abstract: We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to
DeepCamp AI