Optimizers - EXPLAINED!
Key Takeaways
The video explains various optimizers used in machine learning, including Gradient Descent and Adam, and provides resources for further learning.
Full Transcript
you got this believe in yourself Kevin easy easy almost there talk about optimizers optimizers define how neural networks learn they find the values of parameters such that a loss function is at its lowest keep in mind that these optimizers don't know the terrain of the loss so they need to find the bottom of a canyon when line folded essentially let's start with the one the only gradient descent hop hippity hop hop wait too far huh too far again oh come on the original optimizer gradient descent involves taking small steps iteratively until we reach the correct weights theta the problem here is the weight is only updated once after seeing the entire data set so this gradient is typically large theta can only make larger jumps and it may just hover over its optimal value without actually being able to reach it the solution to this update the parameters more frequently like in the case of stochastic gradient descent stochastic gradient descent updates the weights after seeing each data point instead of the entire data set but there's a problem here too you see wait that example was weird no okay easy wait wait easy nope nope no no no oh hell no this may make very noisy jumps that go away from the optimal values it's influenced by every single sample because of this we use mini-batch gradient descent as a compromise updating the parameters only after a few samples huh another way to decrease the noise of stochastic gradient descent is to add the concept of momentum the parameters of a model may have a tendency to change in one direction typically if examples follow a similar pattern with this momentum the model can learn faster by paying little attention to the few examples that throw it off time to time but you might see a problem here bigger bigger bigger do they choosing to blindly ignore samples simply because it isn't typical it may be a costly mistake reflecting in our laws adding an acceleration term though helps your model is training gaining momentum the weights are becoming larger it finds an odd sample because of momentum it thinks very little of it though but discarding it leads to a loss decrease that wasn't as drastic as you thought this is where we decelerate our weight updates the weight updates become smaller again allowing future samples to fine-tune the current model we go big or we go home way will meet the lawsuit decrease as much they thought it would slow down haha not too shabby but this is the loss function for a single predictor using multiple predictors that the learning rate is fixed for every parameter autograph allows an adaptive learning rate for every parameter I'm on a 3d surface plot iran octomorg cool site to plot out equations this is a plot of Z is equal to X square minus y square Z is the value of the loss and this loss has a minimum value of y tending towards negative or positive infinity if I were to start somewhere up here on the saddle point my optimizer would go down in one direction of the y axis like how my cursor is moving with an adaptive loss I have more degrees of freedom to increase my learning rate in the Y direction and decrease it along the x direction in fact this is what we see here adaptive learning rate optimizers are able to learn more along one direction than another hence they can traverse this kind of terrain in the optimizer update the capital gtii is the sum of squares of the gradients with respect to theta i parameter until that point the problem with this is that the G term is monotonically increasing over iterations so the learning rate will decay to a point where the parameter will no longer update and there's no learning we can actually see this effect here for the outer grad point as the iterations go on it learns slower and slower even though the optimal trajectory is quite clear add a delta to the rescue it reduces the influence of past squared gradients by introducing a gamma weight to all of those gradients this reduces their effect by an exponential factor so the denominator doesn't explode and this prevents the learning rate from tanking to zero cool so we actually have learning rate updates for every single parameter well if this is the case why not just go even further and have momentum updates for every parameter and this is what Adam does the only change you need to make from out of Delta to Adam is just add the expected value of past gradients what does it mean it means that we are slow initially but pick up speed over time and this is intuitively similar to momentum as you build up momentum over time in this way Adam can take different size steps for different parameters and with momentum for every parameter it can also lead to faster convergence because of its speed and accuracy I think you can see why Adam can be used as a de-facto optimizer for many projects of course we can go even further introducing acceleration in Adam natum and I could go on it might seem like a ton of optimizers are out there and there are but we've literally just added a term to each algorithm gradually making them capable of more things but with all of these optimizers which is the best one well that depends on the kind of problem that you're trying to solve instant segmentation semantic analysis machine translation image generation so many problems out there with different types of losses the best algorithm is the one that can traverse the loss for that problem pretty well it's more empirical than mathematical I hope this video helps you better understand the role of these optimizers and clear some things up too if you liked the video hit that like click subscribe and also watch some of my other videos on the channel you won't regret it take care
Original Description
From Gradient Descent to Adam. Here are some optimizers you should know. And an easy way to remember them.
SUBSCRIBE to my channel for more good stuff!
REFERENCES
[1] Have fun plotting equations : https://academo.org/demos/3d-surface-plotter
[2] Original paper on the Adam optimizer: https://arxiv.org/pdf/1412.6980.pdf
[3] Blog on types of optimizers: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f
[4] Blog on optimizing gradient descent: https://ruder.io/optimizing-gradient-descent/index.html#adagrad
[5] Github gist of code for rending animation of a math function: https://gist.github.com/ajhalthor/33533b4673ad6955e08a4005850b512f
[6] Another Blog to quench your thirst for knowledge on optimizers cuz the other links weren't good enough: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 40 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
▶
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI