Training-Trajectory-Aware Token Selection

📰 ArXiv cs.AI

arXiv:2601.10348v2 Announce Type: replace-cross Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gr

Published 23 May 2026
Read full paper → ← Back to Reads