TIP: Token Importance in On-Policy Distillation

📰 ArXiv cs.AI

arXiv:2604.14084v1 Announce Type: cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high t

Published 16 Apr 2026
Read full paper → ← Back to Reads