BIG Mistake in Adam | Adam vs AdamW
In this video we clearly explain the difference between Adam optimizer and AdamW optimizer used in deep learning and machine learning.
Many people use Adam without understanding how weight decay and L2 regularization behave inside adaptive optimizers. This video explains:
• Why momentum uses mean of gradients
• Why RMSProp uses squared gradients
• What weight decay actually means
• How L2 regularization changes the gradient
• Why Adam mixes weight decay incorrectly
• How AdamW fixes the problem with decoupled weight decay
This topic is important for anyone working in:
Deep Learning
Machin…
Watch on YouTube ↗
(saves to browser)
DeepCamp AI