Optimizers - EXPLAINED!

CodeEmporium · Beginner ·📄 Research Papers Explained ·6y ago

Skills: ML Maths Basics80%Reading ML Papers70%

Key Takeaways

The video explains various optimizers used in machine learning, including Gradient Descent and Adam, and provides resources for further learning.

Full Transcript

you got this believe in yourself Kevin easy easy almost there talk about optimizers optimizers define how neural networks learn they find the values of parameters such that a loss function is at its lowest keep in mind that these optimizers don't know the terrain of the loss so they need to find the bottom of a canyon when line folded essentially let's start with the one the only gradient descent hop hippity hop hop wait too far huh too far again oh come on the original optimizer gradient descent involves taking small steps iteratively until we reach the correct weights theta the problem here is the weight is only updated once after seeing the entire data set so this gradient is typically large theta can only make larger jumps and it may just hover over its optimal value without actually being able to reach it the solution to this update the parameters more frequently like in the case of stochastic gradient descent stochastic gradient descent updates the weights after seeing each data point instead of the entire data set but there's a problem here too you see wait that example was weird no okay easy wait wait easy nope nope no no no oh hell no this may make very noisy jumps that go away from the optimal values it's influenced by every single sample because of this we use mini-batch gradient descent as a compromise updating the parameters only after a few samples huh another way to decrease the noise of stochastic gradient descent is to add the concept of momentum the parameters of a model may have a tendency to change in one direction typically if examples follow a similar pattern with this momentum the model can learn faster by paying little attention to the few examples that throw it off time to time but you might see a problem here bigger bigger bigger do they choosing to blindly ignore samples simply because it isn't typical it may be a costly mistake reflecting in our laws adding an acceleration term though helps your model is training gaining momentum the weights are becoming larger it finds an odd sample because of momentum it thinks very little of it though but discarding it leads to a loss decrease that wasn't as drastic as you thought this is where we decelerate our weight updates the weight updates become smaller again allowing future samples to fine-tune the current model we go big or we go home way will meet the lawsuit decrease as much they thought it would slow down haha not too shabby but this is the loss function for a single predictor using multiple predictors that the learning rate is fixed for every parameter autograph allows an adaptive learning rate for every parameter I'm on a 3d surface plot iran octomorg cool site to plot out equations this is a plot of Z is equal to X square minus y square Z is the value of the loss and this loss has a minimum value of y tending towards negative or positive infinity if I were to start somewhere up here on the saddle point my optimizer would go down in one direction of the y axis like how my cursor is moving with an adaptive loss I have more degrees of freedom to increase my learning rate in the Y direction and decrease it along the x direction in fact this is what we see here adaptive learning rate optimizers are able to learn more along one direction than another hence they can traverse this kind of terrain in the optimizer update the capital gtii is the sum of squares of the gradients with respect to theta i parameter until that point the problem with this is that the G term is monotonically increasing over iterations so the learning rate will decay to a point where the parameter will no longer update and there's no learning we can actually see this effect here for the outer grad point as the iterations go on it learns slower and slower even though the optimal trajectory is quite clear add a delta to the rescue it reduces the influence of past squared gradients by introducing a gamma weight to all of those gradients this reduces their effect by an exponential factor so the denominator doesn't explode and this prevents the learning rate from tanking to zero cool so we actually have learning rate updates for every single parameter well if this is the case why not just go even further and have momentum updates for every parameter and this is what Adam does the only change you need to make from out of Delta to Adam is just add the expected value of past gradients what does it mean it means that we are slow initially but pick up speed over time and this is intuitively similar to momentum as you build up momentum over time in this way Adam can take different size steps for different parameters and with momentum for every parameter it can also lead to faster convergence because of its speed and accuracy I think you can see why Adam can be used as a de-facto optimizer for many projects of course we can go even further introducing acceleration in Adam natum and I could go on it might seem like a ton of optimizers are out there and there are but we've literally just added a term to each algorithm gradually making them capable of more things but with all of these optimizers which is the best one well that depends on the kind of problem that you're trying to solve instant segmentation semantic analysis machine translation image generation so many problems out there with different types of losses the best algorithm is the one that can traverse the loss for that problem pretty well it's more empirical than mathematical I hope this video helps you better understand the role of these optimizers and clear some things up too if you liked the video hit that like click subscribe and also watch some of my other videos on the channel you won't regret it take care

Original Description

From Gradient Descent to Adam. Here are some optimizers you should know. And an easy way to remember them. SUBSCRIBE to my channel for more good stuff! REFERENCES [1] Have fun plotting equations : https://academo.org/demos/3d-surface-plotter [2] Original paper on the Adam optimizer: https://arxiv.org/pdf/1412.6980.pdf [3] Blog on types of optimizers: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f [4] Blog on optimizing gradient descent: https://ruder.io/optimizing-gradient-descent/index.html#adagrad [5] Github gist of code for rending animation of a math function: https://gist.github.com/ajhalthor/33533b4673ad6955e08a4005850b512f [6] Another Blog to quench your thirst for knowledge on optimizers cuz the other links weren't good enough: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 40 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video explains the basics of optimizers in machine learning, including Gradient Descent and Adam, and provides resources for further learning. It helps viewers understand how to implement these optimizers in their own projects. The video also provides an easy way to remember different optimizers.

Key Takeaways

Learn about Gradient Descent
Understand Adam Optimizer
Read research papers on optimizers
Implement optimizers in projects
Use resources for further learning

💡 The Adam optimizer is a popular and widely-used optimizer in deep learning, and understanding its basics can help improve model performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling