Gradient Descent With Momentum (C2W2L06)
Key Takeaways
Gradient Descent with Momentum algorithm is demonstrated, including its implementation and hyperparameter tuning, to improve the speed of convergence in optimizing cost functions.
Full Transcript
there's an algorithm called momentum or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm in one sentence the basic idea is to compute an exponentially weighted average of your gradients and then they use that gradient update your ways instead in this video let's unpack that one sentence description and see how you can actually implement this as the most of an example let's say that you're trying to optimize a cost function which has contours like this so the red dots denote the position of the minimum maybe you start a gradient descent here and if you take one iteration of gradient descent either Batchelor mini-batch mendes and maybe end up heading there but now you're on the other side of this ellipse you kind of if you take another step of green descent maybe end up doing that and then another step another step and so on and you see that gradient descent will you know sort of take a lot of steps right just slowly oscillate towards the minimum and these up-and-down oscillations slows down gradient descent and prevents you from using a much larger learning rate in particular if you were to use a much larger learning rate you might end up overshooting and then that diverging like so and so the need to prevent the oscillations are getting too big forces you to use the learning rate as not itself too much another way of viewing this problem is that on the vertical axis you once you're learning to be a bit slower because you don't want those oscillations but on the horizontal axis you want to faster learning right because you wanted to aggressively move from left to right or that minimum - or that very thought so here's what you can do if you implement gradient descent with momentum on each iteration or more specifically drink elevation tea you would compute the usual derivatives DWD be I omit the superscript square bracket else but you compute DWD beyond the current mini-batch and we're using bash Korean descent then you know the current mini-batch would be just your whole batch and this works as well of a batch gradient descent so if you're currently me batches your entire training set this works fine as well and then what you do is you compute V D W to be beta v DW plus 1 minus beta DW so this is similar to when we're previously computing V theta equals beta V theta plus 1 minus beta theta T right so it's computing a moving average of the derivatives for W you're getting and then you similarly compute V DP equals that plus 1 minus beta times DB and then you would update your weights using W J's updated as W minus or learning rate times instead of updating it with DW with the derivative you would updated with v DW and similarly B of J's updated as B minus alpha times V DB so what this does is smooth out the steps of gradient descent for example let's say the last few derivatives you computer were this this this this this if you average out these gradients you find that the oscillations in the vertical direction will tend to average out to something close to the zero so in the vertical direction where you want to slow things down this will average out positive negative numbers so the average should be close to zero whereas on the horizontal direction all the derivatives are pointing to the right and horizont direction so the average in a horizontal direction will still be prepaid so that's why with this algorithm with a few innovations you find that the de-rating dissembled momentum ends up eventually just taking steps that are much smaller oscillations in a vertical direction but are more directed to what the horizontal to just moving quickly in the horizontal direction and so this allows your algorithm to you know take a more straightforward path or less to damp out the oscillations in its path to the minimum one intuition for this momentum which works for some people and not for everyone is that if you kind of minimize you know a bowl shape function right this is really the contours of a bowl because I'm not very good at drawing they trying to minimize this type of shape function then these derivative terms you can think of as providing acceleration to a ball that you're rolling downhill and these momentum terms you can think of as representing the velocity and so imagine that you're a bowl and you take a ball and the derivative in pause acceleration to this little ball the little ball is rolling down this hill right and so it rolls faster and faster because of a celebration and beta because this number a little bit less than one this plays a row of friction and it prevents your ball from you know speeding up without limit but so rather than on gradient descent just taking every single step independently of all previous steps now your little ball can roll downhill and gain momentum is going to sell rate down this bowl and therefore gain momentum I find that this ball rolling down the bowl analogy it seems to work for some people who enjoy physics intuitions but it doesn't work for everyone so if this analogy of a ball rolling down a bowl doesn't work for you don't worry about it finally let's look at some details on how you implement this here's the algorithm and so you now have to Hyper parameters the learning rate alpha as well as this parameter beta which controls your exponentially weighted average the most common value for beta is 0.9 we're averaging over the last 10 days temperature so this is like averaging or the last 10 iterations gradients and in practice beta equals 0.9 works very well feel free to try different values and do some hyper parameter search but 0.9 appears to be a pretty robust value well in the how about bias correction right so do you want to take vvw and BTB and divide it by 1 minus beta to the T in practice people don't usually do this because after just 10 iterations your moving average will have warmed up and there's no longer a bias estimate so in practice I don't really see people bothering with bias correction when implementing gradient descent or momentum and of course this process is initialize of a bTW equals 0 note that this is a matrix of zeros or the same dimension as DW which is the same dimension as W and B DB is also initialized to a vector of 0 so the same dimension as DB which in terms of the same dimensions as B finally as you mentioned that if you read the literature on gradient descent with momentum often you see it with this term omitted which is 1 minus beta term omitted so you end up with VD w equals beta v DW plus DW and the net effect of using this version in purple is that v DW ends up being scaled by a factor of 1 minus beta a really 1 over 1 minus beta and so when you're performing these gradient descent update alpha just needs to change by corresponding value of a 1 over 1 minus beta in practice both of these will work just fine it just effects um what's the best value of the learning rate alpha but I find that this particular formulation is a little less intuitive because one impact of this is the end up tuning the hyper parameter beta then the Effects of scaling of bTW and VDB as well and so you end up meeting to retune the learning rate alpha as well maybe so I personally prefer the formulations that I've written here on the left rather than leaving out the 1 minus beta term that I tend to use the formula on the left the printer formula with the 1 minus beta term but for both versions having beta equals 0.9 there's a common choice of hyper parameter it's just that alpha the learning rate will need to be tuned differently for these two different versions so that's it for gradient descent Wolfe momentum this will almost always work better than the straightforward gradient descent algorithm without momentum but there's no other things we could do to speed up your learning algorithm let's continue talking about these in the next couple videos
Original Description
Take the Deep Learning Specialization: http://bit.ly/2Tx5XGn
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 19 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
▶
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
Stop Overfitting With Basically One Line of Code
Medium · AI
🎓
Tutor Explanation
DeepCamp AI