RMSProp (C2W2L07)
Skills:
ML Maths Basics80%
Key Takeaways
Explains RMSProp optimization algorithm for speeding up gradient descent
Full Transcript
you've seen how using momentum can speed up gradient descent there's another algorithm called rmsprop which stands for root mean square prop they can also speed up gradient descent let's see how it works recall our example from before that if you implement gradient descent you can end up with huge oscillations in the vertical direction even while it's trying to make progress in the horizontal direction in order to provide intuition for this example let's say that the vertical axis is the parameter B and the horizontal axis is the parameter W and it really could be W 1 and W 2 or some of the center parameters most names is BMW for the sake of intuition and so you want to slow down the learning in the B direction or in the vertical direction and speed up learning or at least not slow it down in the horizontal direction so this is what the rmsprop algorithm does to accomplish this on iteration T it will compute as usual the derivatives DWD be on the currents mini-batch so lets going to keep this sum exponentially weighted average in step v DW I'm going to use new notation s DW so s DW 0 to beta times their previous value plus 1 minus beta times DW squared sometimes write this DW starts r22 generally expansion which invented this PW squared so for clarity this squaring operation is an element wise squaring operation so what this is doing is really keeping an exponentially weighted average of the squares of the derivatives and similarly we also have s DB equals beta as DB plus 1 minus beta D B squared and again the squaring is an element-wise operation rmsprop then updates the parameters as follows w gets updated as w - the learning rate and whereas previously we had alpha a times DW now as DW divided by square root of s DW and b gives updated as b - a learning rate times instead of just a gradient this is also divided by now divided by s DB so let's gain some intuitions about how this works recall that in the horizontal direction or in this example in the w direction we want learning to go pretty fast whereas in the vertical direction run this example in the B direction we want to slow down or to damp out the oscillations in the vertical direction so with these terms s DW as DB what we're hoping is that SB W will be relatively small so that here we're dividing it by relatively small number whereas DB would be relatively launched so that here we're dividing by relatively large number in order to slow down the updates in the vertical direction and indeed if you look at the derivatives these derivatives are much larger in the vertical direction than in the horizontal direction so you know the slope is very large in the B direction right so with derivatives like this this is a very large DB and a relatively small DW because the function is sloped much more steeply in the vertical direction that is in the B direction then in the W direction and the horizontal direction and so DB squared will be relatively large so as DB we're relatively launch where as compared to that DW will be smaller DW squared will be smaller and so SB W be smaller so the net effect of this is that your updates in the vertical direction are divided by a much larger number and so that helps damp out the oscillations whereas the updates in the horizontal direction are divided by a smaller number so the net impact of using rmsprop is as your updates who end up looking more like right that your update Cindy on vertical direction get down tell but in horizontal direction it can keep going and one infected this is all so that you could therefore use the larger learning rate alpha and get faster learning without the diverging in the vertical direction now just for the sake of clarity I've been calling the vertical and horizontal directions B and W just to illustrate this in practice you're in a very high dimensional space of parameters so maybe the vertical dimensions when you're trying to dampen oscillations is some set of parameters W 1 W 2 W 17 and the horizontal dimensions might be W 3 w 4 u and so on right and so the separation is a WMV is just an illustration in practice DW is a very high dimensional parameter vector DB is also a very high dimensional parameter vector but the intuition is that in dimensions where you're getting these oscillations you end up computing a larger sum or weighted average for these squares of derivatives and so you end up damping out the directions in which there are these oscillations so that's rmsprop and it stands for root mean squared because a root means square prop because here you're squaring the derivatives and then you take the square root here at the end so finally just a couple lost details on this algorithm before we move on in the next video we're actually going to combine rmsprop together with momentum so rather than using the hyper parameter beta which we had used for momentum I'm going to call this hyper parameter beta to just to not clash or the same hyper parameter for both momentum and for harmless problem and also to make sure that your algorithm doesn't divide by zero you know one of square root of s DW right it's very close to zero then this thing could blow up just to ensure the American stability when you implement those in practice you have a very very small epsilon to the denominator that really matter what epsilon is use 10 to the negative eight would be a reasonable default but this just ensures slightly greater numerical stability that you know from numerical roundoff over the reasons that you don't end up dividing by a very small number so that's rmsprop and similar to momentum as the effects of damping out the oscillations in gradient descent in meaning battery under sense and allowing you to maybe use a larger learning rate alpha and certainly speeding up the learning speed of your algorithm so now you know how to implement rmsprop and this will be another way for you to speed up your learning algorithm one fun fact about rmsprop it was actually first proposed now the academic research paper but in a cold seven holes that geoff hinton had taught on Coursera many years ago I guess Coursera wasn't intended to be a platform for dissemination of novel academic research but it worked out pretty well in that case it was really from the Coursera course that rmsprop start to become widely known and it really took off we'll talk about momentum which also rmsprop it turns out that you put them together you can get an even better optimization algorithm let's talk about that in the next video
Original Description
Take the Deep Learning Specialization: http://bit.ly/2PFq843
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 23 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
▶
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…
Medium · Data Science
Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…
Medium · Python
Converting Python Code to JSON
Medium · Python
Nearest Neighbor Classifier
Medium · Machine Learning
🎓
Tutor Explanation
DeepCamp AI