How Transformers Learn to Plan via Multi-Token Prediction
📰 ArXiv cs.AI
arXiv:2604.11912v1 Announce Type: cross Abstract: While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synth
DeepCamp AI