Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

📰 ArXiv cs.AI

On-policy distillation for large language models can be fragile in long-horizon settings, but simple fixes can improve its reliability

advanced Published 27 Mar 2026
Action Steps
  1. Identify the failure modes of on-policy distillation in long-horizon settings
  2. Analyze the effects of reducing distribution matching to a one-token signal
  3. Explore simple fixes to improve the reliability of on-policy distillation, such as modifying the teacher feedback mechanism
Who Needs to Know This

ML researchers and engineers working on large language models can benefit from understanding the limitations of on-policy distillation and how to address them, as it can impact the performance of their models

Key Insight

💡 On-policy distillation can be unreliable in long-horizon settings due to the reduction of distribution matching to a one-token signal

Share This
💡 On-policy distillation for LLMs can be fragile, but simple fixes can help #LLMs #OPD
Read full paper → ← Back to News