This Simple Optimizer Is Revolutionizing How We Train AI [Muon]
The Muon optimizer has demonstrated remarkable performance in accelerating machine learning model training, often outperforming the widely used AdamW optimizer. In this video, we will cover the basic concept of how Muon works and discuss some recent improvements that make it scalable for large-scale LLM training.
00:00 Why Muon?
00:36 Reviewing Adam
02:13 Linear layer
04:24 Solving orthogonalization with SVD
06:28 Newton-Schulz iteration - Odd polynomial matrix
08:11 Newton-Schulz iteration - Example
10:35 The Muon optimizer
11:49 The exploding attention logit crisis
15:13 MuonClip: Extending…
Watch on YouTube ↗
(saves to browser)
Chapters (10)
Why Muon?
0:36
Reviewing Adam
2:13
Linear layer
4:24
Solving orthogonalization with SVD
6:28
Newton-Schulz iteration - Odd polynomial matrix
8:11
Newton-Schulz iteration - Example
10:35
The Muon optimizer
11:49
The exploding attention logit crisis
15:13
MuonClip: Extending QK-clip to Multi-head Latent Attention (MLA)
17:24
Results of MuonClip
DeepCamp AI