How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

📰 ArXiv cs.AI

arXiv:2604.25907v1 Announce Type: cross Abstract: Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All m

Published 29 Apr 2026
Read full paper → ← Back to Reads