Gradient Descent With Momentum (C2W2L06)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

Gradient Descent with Momentum algorithm is demonstrated, including its implementation and hyperparameter tuning, to improve the speed of convergence in optimizing cost functions.

Full Transcript

there's an algorithm called momentum or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm in one sentence the basic idea is to compute an exponentially weighted average of your gradients and then they use that gradient update your ways instead in this video let's unpack that one sentence description and see how you can actually implement this as the most of an example let's say that you're trying to optimize a cost function which has contours like this so the red dots denote the position of the minimum maybe you start a gradient descent here and if you take one iteration of gradient descent either Batchelor mini-batch mendes and maybe end up heading there but now you're on the other side of this ellipse you kind of if you take another step of green descent maybe end up doing that and then another step another step and so on and you see that gradient descent will you know sort of take a lot of steps right just slowly oscillate towards the minimum and these up-and-down oscillations slows down gradient descent and prevents you from using a much larger learning rate in particular if you were to use a much larger learning rate you might end up overshooting and then that diverging like so and so the need to prevent the oscillations are getting too big forces you to use the learning rate as not itself too much another way of viewing this problem is that on the vertical axis you once you're learning to be a bit slower because you don't want those oscillations but on the horizontal axis you want to faster learning right because you wanted to aggressively move from left to right or that minimum - or that very thought so here's what you can do if you implement gradient descent with momentum on each iteration or more specifically drink elevation tea you would compute the usual derivatives DWD be I omit the superscript square bracket else but you compute DWD beyond the current mini-batch and we're using bash Korean descent then you know the current mini-batch would be just your whole batch and this works as well of a batch gradient descent so if you're currently me batches your entire training set this works fine as well and then what you do is you compute V D W to be beta v DW plus 1 minus beta DW so this is similar to when we're previously computing V theta equals beta V theta plus 1 minus beta theta T right so it's computing a moving average of the derivatives for W you're getting and then you similarly compute V DP equals that plus 1 minus beta times DB and then you would update your weights using W J's updated as W minus or learning rate times instead of updating it with DW with the derivative you would updated with v DW and similarly B of J's updated as B minus alpha times V DB so what this does is smooth out the steps of gradient descent for example let's say the last few derivatives you computer were this this this this this if you average out these gradients you find that the oscillations in the vertical direction will tend to average out to something close to the zero so in the vertical direction where you want to slow things down this will average out positive negative numbers so the average should be close to zero whereas on the horizontal direction all the derivatives are pointing to the right and horizont direction so the average in a horizontal direction will still be prepaid so that's why with this algorithm with a few innovations you find that the de-rating dissembled momentum ends up eventually just taking steps that are much smaller oscillations in a vertical direction but are more directed to what the horizontal to just moving quickly in the horizontal direction and so this allows your algorithm to you know take a more straightforward path or less to damp out the oscillations in its path to the minimum one intuition for this momentum which works for some people and not for everyone is that if you kind of minimize you know a bowl shape function right this is really the contours of a bowl because I'm not very good at drawing they trying to minimize this type of shape function then these derivative terms you can think of as providing acceleration to a ball that you're rolling downhill and these momentum terms you can think of as representing the velocity and so imagine that you're a bowl and you take a ball and the derivative in pause acceleration to this little ball the little ball is rolling down this hill right and so it rolls faster and faster because of a celebration and beta because this number a little bit less than one this plays a row of friction and it prevents your ball from you know speeding up without limit but so rather than on gradient descent just taking every single step independently of all previous steps now your little ball can roll downhill and gain momentum is going to sell rate down this bowl and therefore gain momentum I find that this ball rolling down the bowl analogy it seems to work for some people who enjoy physics intuitions but it doesn't work for everyone so if this analogy of a ball rolling down a bowl doesn't work for you don't worry about it finally let's look at some details on how you implement this here's the algorithm and so you now have to Hyper parameters the learning rate alpha as well as this parameter beta which controls your exponentially weighted average the most common value for beta is 0.9 we're averaging over the last 10 days temperature so this is like averaging or the last 10 iterations gradients and in practice beta equals 0.9 works very well feel free to try different values and do some hyper parameter search but 0.9 appears to be a pretty robust value well in the how about bias correction right so do you want to take vvw and BTB and divide it by 1 minus beta to the T in practice people don't usually do this because after just 10 iterations your moving average will have warmed up and there's no longer a bias estimate so in practice I don't really see people bothering with bias correction when implementing gradient descent or momentum and of course this process is initialize of a bTW equals 0 note that this is a matrix of zeros or the same dimension as DW which is the same dimension as W and B DB is also initialized to a vector of 0 so the same dimension as DB which in terms of the same dimensions as B finally as you mentioned that if you read the literature on gradient descent with momentum often you see it with this term omitted which is 1 minus beta term omitted so you end up with VD w equals beta v DW plus DW and the net effect of using this version in purple is that v DW ends up being scaled by a factor of 1 minus beta a really 1 over 1 minus beta and so when you're performing these gradient descent update alpha just needs to change by corresponding value of a 1 over 1 minus beta in practice both of these will work just fine it just effects um what's the best value of the learning rate alpha but I find that this particular formulation is a little less intuitive because one impact of this is the end up tuning the hyper parameter beta then the Effects of scaling of bTW and VDB as well and so you end up meeting to retune the learning rate alpha as well maybe so I personally prefer the formulations that I've written here on the left rather than leaving out the 1 minus beta term that I tend to use the formula on the left the printer formula with the 1 minus beta term but for both versions having beta equals 0.9 there's a common choice of hyper parameter it's just that alpha the learning rate will need to be tuned differently for these two different versions so that's it for gradient descent Wolfe momentum this will almost always work better than the straightforward gradient descent algorithm without momentum but there's no other things we could do to speed up your learning algorithm let's continue talking about these in the next couple videos

Original Description

Take the Deep Learning Specialization: http://bit.ly/2Tx5XGn Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 19 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

Gradient Descent with Momentum is an optimization algorithm that improves the speed of convergence by computing an exponentially weighted average of gradients and using it to update the weights. This algorithm is useful for optimizing cost functions with contours that have oscillations.

Key Takeaways
  1. Compute the derivatives of the cost function
  2. Compute the exponentially weighted average of the derivatives
  3. Update the weights using the weighted average
  4. Tune the hyperparameters, including the learning rate and beta
💡 The momentum term in Gradient Descent with Momentum helps to smooth out the oscillations in the updates, allowing for a more straightforward path to the minimum.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →