PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
📰 ArXiv cs.AI
arXiv:2603.11178v3 Announce Type: replace Abstract: Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the stude
DeepCamp AI