Gradient Descent With Momentum (C2W2L06)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Maths Basics80%Supervised Learning60%

Key Takeaways

Gradient Descent with Momentum algorithm is demonstrated, including its implementation and hyperparameter tuning, to improve the speed of convergence in optimizing cost functions.

Full Transcript

there's an algorithm called momentum or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm in one sentence the basic idea is to compute an exponentially weighted average of your gradients and then they use that gradient update your ways instead in this video let's unpack that one sentence description and see how you can actually implement this as the most of an example let's say that you're trying to optimize a cost function which has contours like this so the red dots denote the position of the minimum maybe you start a gradient descent here and if you take one iteration of gradient descent either Batchelor mini-batch mendes and maybe end up heading there but now you're on the other side of this ellipse you kind of if you take another step of green descent maybe end up doing that and then another step another step and so on and you see that gradient descent will you know sort of take a lot of steps right just slowly oscillate towards the minimum and these up-and-down oscillations slows down gradient descent and prevents you from using a much larger learning rate in particular if you were to use a much larger learning rate you might end up overshooting and then that diverging like so and so the need to prevent the oscillations are getting too big forces you to use the learning rate as not itself too much another way of viewing this problem is that on the vertical axis you once you're learning to be a bit slower because you don't want those oscillations but on the horizontal axis you want to faster learning right because you wanted to aggressively move from left to right or that minimum - or that very thought so here's what you can do if you implement gradient descent with momentum on each iteration or more specifically drink elevation tea you would compute the usual derivatives DWD be I omit the superscript square bracket else but you compute DWD beyond the current mini-batch and we're using bash Korean descent then you know the current mini-batch would be just your whole batch and this works as well of a batch gradient descent so if you're currently me batches your entire training set this works fine as well and then what you do is you compute V D W to be beta v DW plus 1 minus beta DW so this is similar to when we're previously computing V theta equals beta V theta plus 1 minus beta theta T right so it's computing a moving average of the derivatives for W you're getting and then you similarly compute V DP equals that plus 1 minus beta times DB and then you would update your weights using W J's updated as W minus or learning rate times instead of updating it with DW with the derivative you would updated with v DW and similarly B of J's updated as B minus alpha times V DB so what this does is smooth out the steps of gradient descent for example let's say the last few derivatives you computer were this this this this this if you average out these gradients you find that the oscillations in the vertical direction will tend to average out to something close to the zero so in the vertical direction where you want to slow things down this will average out positive negative numbers so the average should be close to zero whereas on the horizontal direction all the derivatives are pointing to the right and horizont direction so the average in a horizontal direction will still be prepaid so that's why with this algorithm with a few innovations you find that the de-rating dissembled momentum ends up eventually just taking steps that are much smaller oscillations in a vertical direction but are more directed to what the horizontal to just moving quickly in the horizontal direction and so this allows your algorithm to you know take a more straightforward path or less to damp out the oscillations in its path to the minimum one intuition for this momentum which works for some people and not for everyone is that if you kind of minimize you know a bowl shape function right this is really the contours of a bowl because I'm not very good at drawing they trying to minimize this type of shape function then these derivative terms you can think of as providing acceleration to a ball that you're rolling downhill and these momentum terms you can think of as representing the velocity and so imagine that you're a bowl and you take a ball and the derivative in pause acceleration to this little ball the little ball is rolling down this hill right and so it rolls faster and faster because of a celebration and beta because this number a little bit less than one this plays a row of friction and it prevents your ball from you know speeding up without limit but so rather than on gradient descent just taking every single step independently of all previous steps now your little ball can roll downhill and gain momentum is going to sell rate down this bowl and therefore gain momentum I find that this ball rolling down the bowl analogy it seems to work for some people who enjoy physics intuitions but it doesn't work for everyone so if this analogy of a ball rolling down a bowl doesn't work for you don't worry about it finally let's look at some details on how you implement this here's the algorithm and so you now have to Hyper parameters the learning rate alpha as well as this parameter beta which controls your exponentially weighted average the most common value for beta is 0.9 we're averaging over the last 10 days temperature so this is like averaging or the last 10 iterations gradients and in practice beta equals 0.9 works very well feel free to try different values and do some hyper parameter search but 0.9 appears to be a pretty robust value well in the how about bias correction right so do you want to take vvw and BTB and divide it by 1 minus beta to the T in practice people don't usually do this because after just 10 iterations your moving average will have warmed up and there's no longer a bias estimate so in practice I don't really see people bothering with bias correction when implementing gradient descent or momentum and of course this process is initialize of a bTW equals 0 note that this is a matrix of zeros or the same dimension as DW which is the same dimension as W and B DB is also initialized to a vector of 0 so the same dimension as DB which in terms of the same dimensions as B finally as you mentioned that if you read the literature on gradient descent with momentum often you see it with this term omitted which is 1 minus beta term omitted so you end up with VD w equals beta v DW plus DW and the net effect of using this version in purple is that v DW ends up being scaled by a factor of 1 minus beta a really 1 over 1 minus beta and so when you're performing these gradient descent update alpha just needs to change by corresponding value of a 1 over 1 minus beta in practice both of these will work just fine it just effects um what's the best value of the learning rate alpha but I find that this particular formulation is a little less intuitive because one impact of this is the end up tuning the hyper parameter beta then the Effects of scaling of bTW and VDB as well and so you end up meeting to retune the learning rate alpha as well maybe so I personally prefer the formulations that I've written here on the left rather than leaving out the 1 minus beta term that I tend to use the formula on the left the printer formula with the 1 minus beta term but for both versions having beta equals 0.9 there's a common choice of hyper parameter it's just that alpha the learning rate will need to be tuned differently for these two different versions so that's it for gradient descent Wolfe momentum this will almost always work better than the straightforward gradient descent algorithm without momentum but there's no other things we could do to speed up your learning algorithm let's continue talking about these in the next couple videos

Original Description

Take the Deep Learning Specialization: http://bit.ly/2Tx5XGn Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 19 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

Gradient Descent with Momentum is an optimization algorithm that improves the speed of convergence by computing an exponentially weighted average of gradients and using it to update the weights. This algorithm is useful for optimizing cost functions with contours that have oscillations.

Key Takeaways

Compute the derivatives of the cost function
Compute the exponentially weighted average of the derivatives
Update the weights using the weighted average
Tune the hyperparameters, including the learning rate and beta

💡 The momentum term in Gradient Descent with Momentum helps to smooth out the oscillations in the updates, allowing for a more straightforward path to the minimum.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.

Learn what makes a standout ML candidate after interviewing over 100 applicants

Medium · Machine Learning

How AI Learns with Less Labeled Data

Discover how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Medium · Machine Learning

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Learn Deep Learning by Hand (Beginner's Guide - Part 1)