Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

📰 ArXiv cs.AI

Adam optimizer can beat SGD due to second-moment normalization, yielding sharper tails in the loss landscape

advanced Published 27 Mar 2026

Action Steps

Understand the classical bounded variance model and its limitations in explaining Adam's empirical performance
Develop a stopping-time/martingale analysis to distinguish Adam from SGD
Recognize the role of second-moment normalization in Adam's faster convergence
Apply the insights to optimize model training and hyperparameter tuning

Who Needs to Know This

Machine learning researchers and engineers can benefit from understanding the theoretical advantages of Adam over SGD, leading to improved model training and convergence

Key Insight

💡 Second-moment normalization in Adam leads to sharper tails in the loss landscape, explaining its faster empirical convergence