This Simple Optimizer Is Revolutionizing How We Train AI [Muon]

Jia-Bin Huang · Beginner ·🧠 Large Language Models ·5mo ago
The Muon optimizer has demonstrated remarkable performance in accelerating machine learning model training, often outperforming the widely used AdamW optimizer. In this video, we will cover the basic concept of how Muon works and discuss some recent improvements that make it scalable for large-scale LLM training. 00:00 Why Muon? 00:36 Reviewing Adam 02:13 Linear layer 04:24 Solving orthogonalization with SVD 06:28 Newton-Schulz iteration - Odd polynomial matrix 08:11 Newton-Schulz iteration - Example 10:35 The Muon optimizer 11:49 The exploding attention logit crisis 15:13 MuonClip: Extending…
Watch on YouTube ↗ (saves to browser)

Chapters (10)

Why Muon?
0:36 Reviewing Adam
2:13 Linear layer
4:24 Solving orthogonalization with SVD
6:28 Newton-Schulz iteration - Odd polynomial matrix
8:11 Newton-Schulz iteration - Example
10:35 The Muon optimizer
11:49 The exploding attention logit crisis
15:13 MuonClip: Extending QK-clip to Multi-head Latent Attention (MLA)
17:24 Results of MuonClip
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)