Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

📰 ArXiv cs.AI

arXiv:2604.09967v1 Announce Type: cross Abstract: Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon$^2$, an extension of Muon that applies Adam-style adapt

Published 14 Apr 2026

Read full paper → ← Back to Reads