PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

📰 ArXiv cs.AI

arXiv:2603.11178v3 Announce Type: replace Abstract: Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the stude

Published 13 Apr 2026
Read full paper → ← Back to Reads