Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

📰 ArXiv cs.AI

On-policy distillation for large language models can be fragile in long-horizon settings, but simple fixes can improve its reliability

advanced Published 27 Mar 2026

Action Steps

Identify the failure modes of on-policy distillation in long-horizon settings
Analyze the effects of reducing distribution matching to a one-token signal
Explore simple fixes to improve the reliability of on-policy distillation, such as modifying the teacher feedback mechanism

Who Needs to Know This

ML researchers and engineers working on large language models can benefit from understanding the limitations of on-policy distillation and how to address them, as it can impact the performance of their models

Key Insight

💡 On-policy distillation can be unreliable in long-horizon settings due to the reduction of distribution matching to a one-token signal