Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
📰 ArXiv cs.AI
On-policy distillation for large language models can be fragile in long-horizon settings, but simple fixes can improve its reliability
Action Steps
- Identify the failure modes of on-policy distillation in long-horizon settings
- Analyze the effects of reducing distribution matching to a one-token signal
- Explore simple fixes to improve the reliability of on-policy distillation, such as modifying the teacher feedback mechanism
Who Needs to Know This
ML researchers and engineers working on large language models can benefit from understanding the limitations of on-policy distillation and how to address them, as it can impact the performance of their models
Key Insight
💡 On-policy distillation can be unreliable in long-horizon settings due to the reduction of distribution matching to a one-token signal
Share This
💡 On-policy distillation for LLMs can be fragile, but simple fixes can help #LLMs #OPD
DeepCamp AI