RMSProp (C2W2L07)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

Explains RMSProp optimization algorithm for speeding up gradient descent

Full Transcript

you've seen how using momentum can speed up gradient descent there's another algorithm called rmsprop which stands for root mean square prop they can also speed up gradient descent let's see how it works recall our example from before that if you implement gradient descent you can end up with huge oscillations in the vertical direction even while it's trying to make progress in the horizontal direction in order to provide intuition for this example let's say that the vertical axis is the parameter B and the horizontal axis is the parameter W and it really could be W 1 and W 2 or some of the center parameters most names is BMW for the sake of intuition and so you want to slow down the learning in the B direction or in the vertical direction and speed up learning or at least not slow it down in the horizontal direction so this is what the rmsprop algorithm does to accomplish this on iteration T it will compute as usual the derivatives DWD be on the currents mini-batch so lets going to keep this sum exponentially weighted average in step v DW I'm going to use new notation s DW so s DW 0 to beta times their previous value plus 1 minus beta times DW squared sometimes write this DW starts r22 generally expansion which invented this PW squared so for clarity this squaring operation is an element wise squaring operation so what this is doing is really keeping an exponentially weighted average of the squares of the derivatives and similarly we also have s DB equals beta as DB plus 1 minus beta D B squared and again the squaring is an element-wise operation rmsprop then updates the parameters as follows w gets updated as w - the learning rate and whereas previously we had alpha a times DW now as DW divided by square root of s DW and b gives updated as b - a learning rate times instead of just a gradient this is also divided by now divided by s DB so let's gain some intuitions about how this works recall that in the horizontal direction or in this example in the w direction we want learning to go pretty fast whereas in the vertical direction run this example in the B direction we want to slow down or to damp out the oscillations in the vertical direction so with these terms s DW as DB what we're hoping is that SB W will be relatively small so that here we're dividing it by relatively small number whereas DB would be relatively launched so that here we're dividing by relatively large number in order to slow down the updates in the vertical direction and indeed if you look at the derivatives these derivatives are much larger in the vertical direction than in the horizontal direction so you know the slope is very large in the B direction right so with derivatives like this this is a very large DB and a relatively small DW because the function is sloped much more steeply in the vertical direction that is in the B direction then in the W direction and the horizontal direction and so DB squared will be relatively large so as DB we're relatively launch where as compared to that DW will be smaller DW squared will be smaller and so SB W be smaller so the net effect of this is that your updates in the vertical direction are divided by a much larger number and so that helps damp out the oscillations whereas the updates in the horizontal direction are divided by a smaller number so the net impact of using rmsprop is as your updates who end up looking more like right that your update Cindy on vertical direction get down tell but in horizontal direction it can keep going and one infected this is all so that you could therefore use the larger learning rate alpha and get faster learning without the diverging in the vertical direction now just for the sake of clarity I've been calling the vertical and horizontal directions B and W just to illustrate this in practice you're in a very high dimensional space of parameters so maybe the vertical dimensions when you're trying to dampen oscillations is some set of parameters W 1 W 2 W 17 and the horizontal dimensions might be W 3 w 4 u and so on right and so the separation is a WMV is just an illustration in practice DW is a very high dimensional parameter vector DB is also a very high dimensional parameter vector but the intuition is that in dimensions where you're getting these oscillations you end up computing a larger sum or weighted average for these squares of derivatives and so you end up damping out the directions in which there are these oscillations so that's rmsprop and it stands for root mean squared because a root means square prop because here you're squaring the derivatives and then you take the square root here at the end so finally just a couple lost details on this algorithm before we move on in the next video we're actually going to combine rmsprop together with momentum so rather than using the hyper parameter beta which we had used for momentum I'm going to call this hyper parameter beta to just to not clash or the same hyper parameter for both momentum and for harmless problem and also to make sure that your algorithm doesn't divide by zero you know one of square root of s DW right it's very close to zero then this thing could blow up just to ensure the American stability when you implement those in practice you have a very very small epsilon to the denominator that really matter what epsilon is use 10 to the negative eight would be a reasonable default but this just ensures slightly greater numerical stability that you know from numerical roundoff over the reasons that you don't end up dividing by a very small number so that's rmsprop and similar to momentum as the effects of damping out the oscillations in gradient descent in meaning battery under sense and allowing you to maybe use a larger learning rate alpha and certainly speeding up the learning speed of your algorithm so now you know how to implement rmsprop and this will be another way for you to speed up your learning algorithm one fun fact about rmsprop it was actually first proposed now the academic research paper but in a cold seven holes that geoff hinton had taught on Coursera many years ago I guess Coursera wasn't intended to be a platform for dissemination of novel academic research but it worked out pretty well in that case it was really from the Coursera course that rmsprop start to become widely known and it really took off we'll talk about momentum which also rmsprop it turns out that you put them together you can get an even better optimization algorithm let's talk about that in the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2PFq843 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 23 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

Related Reads

📰
Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…
Compare the performance of AlexNet and VGG-16 architectures on the CIFAR-10 dataset using CPU, and learn how to implement and evaluate these models
Medium · Data Science
📰
Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…
Compare the performance of AlexNet and VGG-16 architectures on CIFAR-10 dataset using CPU, and learn how to implement them in Python
Medium · Python
📰
Converting Python Code to JSON
Learn to convert Python code to JSON format and vice versa to enhance data interchange and storage capabilities
Medium · Python
📰
Nearest Neighbor Classifier
Learn to implement a Nearest Neighbor Classifier, a simple yet effective machine learning model, and understand its applications
Medium · Machine Learning
Up next
1. Overview of Artificial Intelligence | What is AI? Fundamental Concepts & Complete History of AI
Professor Rahul Jain
Watch →