TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
📰 ArXiv cs.AI
arXiv:2604.24005v2 Announce Type: cross Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together wi
DeepCamp AI