Learning Rate Decay (C2W2L09)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Maths Basics70%Supervised Learning50%

Key Takeaways

The video discusses learning rate decay, a technique to speed up learning algorithms by slowly reducing the learning rate over time, and provides examples of how to implement it, including exponential decay and discrete staircase decay.

Full Transcript

one of the things that might help speed up your learning algorithm is to slowly reduce your learning rate over time we call this learning rate decay let's see how you can implement this let's start - an example of why you might want to implement learning rate decay suppose you're implementing mini batch gradient descent with a reasonably small mini batch maybe a mini batch has just 64 128 examples then as you iterate your steps will be a little bit noisy and it will tend toward this minimum over here but it won't exactly converge but your algorithm might just end up wandering around and never really converge because you're using some fixed value for alpha and there's just some noise in your different mini batches but if you were to slowly reduce your learning rate alpha then during the initial phases while your learning rate alpha still lasts you can still have it to be fast learning but then as alpha gets smaller your steps you take would be slower and smaller and so you end up oscillating in a tighter region around this minimum rather than one ring far away even as training goes on and on so the intuition behind slowly reducing alpha is that maybe during the initial steps of learning you can afford to take much bigger steps but then as learning approaches convergence then having a slower learning rate allows you to take smaller steps so here's how you can implement learning rate decay recall that one epoch is one class through the data right so if you have them a training set as follows maybe break it up into different mini batches then once the first pass through the training set is called the first epoch and then the second pass is the second epoch and so on so one thing you could do is set your learning rate alpha to be equal to one over one plus a per hour originally called the decay rate times the epoch num and there's going to be times some initial learning rate alpha zero note that the decay rate here it becomes another hyper parameter which you might need to tune so here's a concrete example um if you take several epochs so several passes through your data if alpha zero is equal to zero point two and the decay rate is equal to one then doing your first epoch alpha will be 1 over 1 plus 1 times alpha 0 so your learning rate will be zero point one that's just your evaluating this formula when the decay rate is equal to 1 and the epochal on this one on the second you pop your learning rate the case to 0.67 on the third 0.5 on the fourth 0.4 and so on fearful evaluate well these values yourself and get a sense that you know as a function of your epoch number your learning rate gradually decreases whereas this according to this formula up on top so if you wish to use learning rate decay what you can do is try to provide your values of both hyper parameter alpha 0 as well as of this decay rate hyper parameter and then try to find a value that works well other than this formula for learning rate decay there are a few other ways that people use for example this is called exponential decay where alpha is equal to some number less than 1 such as 0.9 5 times epoch num times alpha 0 so this will exponentially quickly decay your learning rate other formulas that people use are things like alpha equals some constant over EPOC numb square root times alpha zero or some constants cave another hyper counter over dr.mini Bosch number P square root 2 times alpha zero and sometimes you also see people use a learning rate that decreases and discrete stats where for some number of steps you have some learning rate and then after a while you decrease it by one half after a while by one half after a while by one half and so this is a discrete staircase so so far we've talked about some using some you know formula to govern how alpha the learning rate changes over time one other thing that people sometimes do is nanyo decay and so if you're training just one model at a time and the dual model takes many hours or even many days to Train what some people will do is just wash your model as this training over your a large number of days and then annually say oh it looks like the learning rate slowed down I'm going to decrease out for a little bit of course this works this manually controlling alpha really tuning alpha by hand all by hour day by day this works only if you're training only a small number of models but sometimes people do that as well so now you have a few more options so how to control the learning rate alpha now in case you're thinking wow this is a lot of hyper parameters how that select amongst all these different options I would say don't worry about it for now in next week we'll talk more about how to systematically choose hyper parameters for me I would say that learning rate is usually lower down or the list of things I try setting alpha just a fixed value of alpha and getting that to be wealthy and has a huge in time learning rate decay does help sometimes it can really help speed up training but it is a little bit lower down my list when in terms of the things I would try but next we want to talk about hyper parameter tuning you see more systematic ways to organize all of these hyper parameters and how to efficiently search amongst them so that's it for learning rate is hey um finally I want to also want to talk a little bit about local optimal and saddle points in new networks so you can have a little bit better intuition about the types of optimization problems your optimization algorithm is trying to solve when you're trying to train these in your network let's go onto the next video to see that

Original Description

Take the Deep Learning Specialization: http://bit.ly/2Tx69W7 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 11 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

The video teaches learning rate decay, a technique to speed up learning algorithms, and provides examples of implementation, including exponential decay and discrete staircase decay. It also discusses hyper parameter tuning and optimization techniques.

Key Takeaways

Implement mini batch gradient descent
Set initial learning rate alpha
Choose a decay rate
Apply exponential decay or discrete staircase decay
Tune hyper parameters

💡 Learning rate decay can help speed up training by reducing the learning rate over time, allowing for faster convergence to the optimal solution.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Data Science

The Python Dictionary Trick That Makes Interviewers Smile

Learn the Python dictionary trick that impresses interviewers and improves your coding skills

Dev.to · Ameer Abdullah

I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026

Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB