Optimizers - EXPLAINED!

CodeEmporium · Beginner ·📄 Research Papers Explained ·6y ago

Key Takeaways

The video explains various optimizers used in machine learning, including Gradient Descent and Adam, and provides resources for further learning.

Full Transcript

you got this believe in yourself Kevin easy easy almost there talk about optimizers optimizers define how neural networks learn they find the values of parameters such that a loss function is at its lowest keep in mind that these optimizers don't know the terrain of the loss so they need to find the bottom of a canyon when line folded essentially let's start with the one the only gradient descent hop hippity hop hop wait too far huh too far again oh come on the original optimizer gradient descent involves taking small steps iteratively until we reach the correct weights theta the problem here is the weight is only updated once after seeing the entire data set so this gradient is typically large theta can only make larger jumps and it may just hover over its optimal value without actually being able to reach it the solution to this update the parameters more frequently like in the case of stochastic gradient descent stochastic gradient descent updates the weights after seeing each data point instead of the entire data set but there's a problem here too you see wait that example was weird no okay easy wait wait easy nope nope no no no oh hell no this may make very noisy jumps that go away from the optimal values it's influenced by every single sample because of this we use mini-batch gradient descent as a compromise updating the parameters only after a few samples huh another way to decrease the noise of stochastic gradient descent is to add the concept of momentum the parameters of a model may have a tendency to change in one direction typically if examples follow a similar pattern with this momentum the model can learn faster by paying little attention to the few examples that throw it off time to time but you might see a problem here bigger bigger bigger do they choosing to blindly ignore samples simply because it isn't typical it may be a costly mistake reflecting in our laws adding an acceleration term though helps your model is training gaining momentum the weights are becoming larger it finds an odd sample because of momentum it thinks very little of it though but discarding it leads to a loss decrease that wasn't as drastic as you thought this is where we decelerate our weight updates the weight updates become smaller again allowing future samples to fine-tune the current model we go big or we go home way will meet the lawsuit decrease as much they thought it would slow down haha not too shabby but this is the loss function for a single predictor using multiple predictors that the learning rate is fixed for every parameter autograph allows an adaptive learning rate for every parameter I'm on a 3d surface plot iran octomorg cool site to plot out equations this is a plot of Z is equal to X square minus y square Z is the value of the loss and this loss has a minimum value of y tending towards negative or positive infinity if I were to start somewhere up here on the saddle point my optimizer would go down in one direction of the y axis like how my cursor is moving with an adaptive loss I have more degrees of freedom to increase my learning rate in the Y direction and decrease it along the x direction in fact this is what we see here adaptive learning rate optimizers are able to learn more along one direction than another hence they can traverse this kind of terrain in the optimizer update the capital gtii is the sum of squares of the gradients with respect to theta i parameter until that point the problem with this is that the G term is monotonically increasing over iterations so the learning rate will decay to a point where the parameter will no longer update and there's no learning we can actually see this effect here for the outer grad point as the iterations go on it learns slower and slower even though the optimal trajectory is quite clear add a delta to the rescue it reduces the influence of past squared gradients by introducing a gamma weight to all of those gradients this reduces their effect by an exponential factor so the denominator doesn't explode and this prevents the learning rate from tanking to zero cool so we actually have learning rate updates for every single parameter well if this is the case why not just go even further and have momentum updates for every parameter and this is what Adam does the only change you need to make from out of Delta to Adam is just add the expected value of past gradients what does it mean it means that we are slow initially but pick up speed over time and this is intuitively similar to momentum as you build up momentum over time in this way Adam can take different size steps for different parameters and with momentum for every parameter it can also lead to faster convergence because of its speed and accuracy I think you can see why Adam can be used as a de-facto optimizer for many projects of course we can go even further introducing acceleration in Adam natum and I could go on it might seem like a ton of optimizers are out there and there are but we've literally just added a term to each algorithm gradually making them capable of more things but with all of these optimizers which is the best one well that depends on the kind of problem that you're trying to solve instant segmentation semantic analysis machine translation image generation so many problems out there with different types of losses the best algorithm is the one that can traverse the loss for that problem pretty well it's more empirical than mathematical I hope this video helps you better understand the role of these optimizers and clear some things up too if you liked the video hit that like click subscribe and also watch some of my other videos on the channel you won't regret it take care

Original Description

From Gradient Descent to Adam. Here are some optimizers you should know. And an easy way to remember them. SUBSCRIBE to my channel for more good stuff! REFERENCES [1] Have fun plotting equations : https://academo.org/demos/3d-surface-plotter [2] Original paper on the Adam optimizer: https://arxiv.org/pdf/1412.6980.pdf [3] Blog on types of optimizers: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f [4] Blog on optimizing gradient descent: https://ruder.io/optimizing-gradient-descent/index.html#adagrad [5] Github gist of code for rending animation of a math function: https://gist.github.com/ajhalthor/33533b4673ad6955e08a4005850b512f [6] Another Blog to quench your thirst for knowledge on optimizers cuz the other links weren't good enough: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 40 of 60

1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video explains the basics of optimizers in machine learning, including Gradient Descent and Adam, and provides resources for further learning. It helps viewers understand how to implement these optimizers in their own projects. The video also provides an easy way to remember different optimizers.

Key Takeaways
  1. Learn about Gradient Descent
  2. Understand Adam Optimizer
  3. Read research papers on optimizers
  4. Implement optimizers in projects
  5. Use resources for further learning
💡 The Adam optimizer is a popular and widely-used optimizer in deep learning, and understanding its basics can help improve model performance.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →