RMSProp (C2W2L07)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Maths Basics80%

Key Takeaways

Explains RMSProp optimization algorithm for speeding up gradient descent

Full Transcript

you've seen how using momentum can speed up gradient descent there's another algorithm called rmsprop which stands for root mean square prop they can also speed up gradient descent let's see how it works recall our example from before that if you implement gradient descent you can end up with huge oscillations in the vertical direction even while it's trying to make progress in the horizontal direction in order to provide intuition for this example let's say that the vertical axis is the parameter B and the horizontal axis is the parameter W and it really could be W 1 and W 2 or some of the center parameters most names is BMW for the sake of intuition and so you want to slow down the learning in the B direction or in the vertical direction and speed up learning or at least not slow it down in the horizontal direction so this is what the rmsprop algorithm does to accomplish this on iteration T it will compute as usual the derivatives DWD be on the currents mini-batch so lets going to keep this sum exponentially weighted average in step v DW I'm going to use new notation s DW so s DW 0 to beta times their previous value plus 1 minus beta times DW squared sometimes write this DW starts r22 generally expansion which invented this PW squared so for clarity this squaring operation is an element wise squaring operation so what this is doing is really keeping an exponentially weighted average of the squares of the derivatives and similarly we also have s DB equals beta as DB plus 1 minus beta D B squared and again the squaring is an element-wise operation rmsprop then updates the parameters as follows w gets updated as w - the learning rate and whereas previously we had alpha a times DW now as DW divided by square root of s DW and b gives updated as b - a learning rate times instead of just a gradient this is also divided by now divided by s DB so let's gain some intuitions about how this works recall that in the horizontal direction or in this example in the w direction we want learning to go pretty fast whereas in the vertical direction run this example in the B direction we want to slow down or to damp out the oscillations in the vertical direction so with these terms s DW as DB what we're hoping is that SB W will be relatively small so that here we're dividing it by relatively small number whereas DB would be relatively launched so that here we're dividing by relatively large number in order to slow down the updates in the vertical direction and indeed if you look at the derivatives these derivatives are much larger in the vertical direction than in the horizontal direction so you know the slope is very large in the B direction right so with derivatives like this this is a very large DB and a relatively small DW because the function is sloped much more steeply in the vertical direction that is in the B direction then in the W direction and the horizontal direction and so DB squared will be relatively large so as DB we're relatively launch where as compared to that DW will be smaller DW squared will be smaller and so SB W be smaller so the net effect of this is that your updates in the vertical direction are divided by a much larger number and so that helps damp out the oscillations whereas the updates in the horizontal direction are divided by a smaller number so the net impact of using rmsprop is as your updates who end up looking more like right that your update Cindy on vertical direction get down tell but in horizontal direction it can keep going and one infected this is all so that you could therefore use the larger learning rate alpha and get faster learning without the diverging in the vertical direction now just for the sake of clarity I've been calling the vertical and horizontal directions B and W just to illustrate this in practice you're in a very high dimensional space of parameters so maybe the vertical dimensions when you're trying to dampen oscillations is some set of parameters W 1 W 2 W 17 and the horizontal dimensions might be W 3 w 4 u and so on right and so the separation is a WMV is just an illustration in practice DW is a very high dimensional parameter vector DB is also a very high dimensional parameter vector but the intuition is that in dimensions where you're getting these oscillations you end up computing a larger sum or weighted average for these squares of derivatives and so you end up damping out the directions in which there are these oscillations so that's rmsprop and it stands for root mean squared because a root means square prop because here you're squaring the derivatives and then you take the square root here at the end so finally just a couple lost details on this algorithm before we move on in the next video we're actually going to combine rmsprop together with momentum so rather than using the hyper parameter beta which we had used for momentum I'm going to call this hyper parameter beta to just to not clash or the same hyper parameter for both momentum and for harmless problem and also to make sure that your algorithm doesn't divide by zero you know one of square root of s DW right it's very close to zero then this thing could blow up just to ensure the American stability when you implement those in practice you have a very very small epsilon to the denominator that really matter what epsilon is use 10 to the negative eight would be a reasonable default but this just ensures slightly greater numerical stability that you know from numerical roundoff over the reasons that you don't end up dividing by a very small number so that's rmsprop and similar to momentum as the effects of damping out the oscillations in gradient descent in meaning battery under sense and allowing you to maybe use a larger learning rate alpha and certainly speeding up the learning speed of your algorithm so now you know how to implement rmsprop and this will be another way for you to speed up your learning algorithm one fun fact about rmsprop it was actually first proposed now the academic research paper but in a cold seven holes that geoff hinton had taught on Coursera many years ago I guess Coursera wasn't intended to be a platform for dissemination of novel academic research but it worked out pretty well in that case it was really from the Coursera course that rmsprop start to become widely known and it really took off we'll talk about momentum which also rmsprop it turns out that you put them together you can get an even better optimization algorithm let's talk about that in the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2PFq843 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 23 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related Reads

Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…

Compare the performance of AlexNet and VGG-16 architectures on the CIFAR-10 dataset using CPU, and learn how to implement and evaluate these models

Medium · Data Science

Analisis Komparasi Performa Arsitektur AlexNet dan VGG-16 pada Klasifikasi Dataset CIFAR-10…

Compare the performance of AlexNet and VGG-16 architectures on CIFAR-10 dataset using CPU, and learn how to implement them in Python

Medium · Python

Converting Python Code to JSON

Learn to convert Python code to JSON format and vice versa to enhance data interchange and storage capabilities

Medium · Python

Nearest Neighbor Classifier

Learn to implement a Nearest Neighbor Classifier, a simple yet effective machine learning model, and understand its applications

Medium · Machine Learning

1. Overview of Artificial Intelligence | What is AI? Fundamental Concepts & Complete History of AI

Professor Rahul Jain