Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
📰 ArXiv cs.AI
arXiv:2604.09967v1 Announce Type: cross Abstract: Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon$^2$, an extension of Muon that applies Adam-style adapt
DeepCamp AI