Learning Rate Decay (C2W2L09)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

The video discusses learning rate decay, a technique to speed up learning algorithms by slowly reducing the learning rate over time, and provides examples of how to implement it, including exponential decay and discrete staircase decay.

Full Transcript

one of the things that might help speed up your learning algorithm is to slowly reduce your learning rate over time we call this learning rate decay let's see how you can implement this let's start - an example of why you might want to implement learning rate decay suppose you're implementing mini batch gradient descent with a reasonably small mini batch maybe a mini batch has just 64 128 examples then as you iterate your steps will be a little bit noisy and it will tend toward this minimum over here but it won't exactly converge but your algorithm might just end up wandering around and never really converge because you're using some fixed value for alpha and there's just some noise in your different mini batches but if you were to slowly reduce your learning rate alpha then during the initial phases while your learning rate alpha still lasts you can still have it to be fast learning but then as alpha gets smaller your steps you take would be slower and smaller and so you end up oscillating in a tighter region around this minimum rather than one ring far away even as training goes on and on so the intuition behind slowly reducing alpha is that maybe during the initial steps of learning you can afford to take much bigger steps but then as learning approaches convergence then having a slower learning rate allows you to take smaller steps so here's how you can implement learning rate decay recall that one epoch is one class through the data right so if you have them a training set as follows maybe break it up into different mini batches then once the first pass through the training set is called the first epoch and then the second pass is the second epoch and so on so one thing you could do is set your learning rate alpha to be equal to one over one plus a per hour originally called the decay rate times the epoch num and there's going to be times some initial learning rate alpha zero note that the decay rate here it becomes another hyper parameter which you might need to tune so here's a concrete example um if you take several epochs so several passes through your data if alpha zero is equal to zero point two and the decay rate is equal to one then doing your first epoch alpha will be 1 over 1 plus 1 times alpha 0 so your learning rate will be zero point one that's just your evaluating this formula when the decay rate is equal to 1 and the epochal on this one on the second you pop your learning rate the case to 0.67 on the third 0.5 on the fourth 0.4 and so on fearful evaluate well these values yourself and get a sense that you know as a function of your epoch number your learning rate gradually decreases whereas this according to this formula up on top so if you wish to use learning rate decay what you can do is try to provide your values of both hyper parameter alpha 0 as well as of this decay rate hyper parameter and then try to find a value that works well other than this formula for learning rate decay there are a few other ways that people use for example this is called exponential decay where alpha is equal to some number less than 1 such as 0.9 5 times epoch num times alpha 0 so this will exponentially quickly decay your learning rate other formulas that people use are things like alpha equals some constant over EPOC numb square root times alpha zero or some constants cave another hyper counter over dr.mini Bosch number P square root 2 times alpha zero and sometimes you also see people use a learning rate that decreases and discrete stats where for some number of steps you have some learning rate and then after a while you decrease it by one half after a while by one half after a while by one half and so this is a discrete staircase so so far we've talked about some using some you know formula to govern how alpha the learning rate changes over time one other thing that people sometimes do is nanyo decay and so if you're training just one model at a time and the dual model takes many hours or even many days to Train what some people will do is just wash your model as this training over your a large number of days and then annually say oh it looks like the learning rate slowed down I'm going to decrease out for a little bit of course this works this manually controlling alpha really tuning alpha by hand all by hour day by day this works only if you're training only a small number of models but sometimes people do that as well so now you have a few more options so how to control the learning rate alpha now in case you're thinking wow this is a lot of hyper parameters how that select amongst all these different options I would say don't worry about it for now in next week we'll talk more about how to systematically choose hyper parameters for me I would say that learning rate is usually lower down or the list of things I try setting alpha just a fixed value of alpha and getting that to be wealthy and has a huge in time learning rate decay does help sometimes it can really help speed up training but it is a little bit lower down my list when in terms of the things I would try but next we want to talk about hyper parameter tuning you see more systematic ways to organize all of these hyper parameters and how to efficiently search amongst them so that's it for learning rate is hey um finally I want to also want to talk a little bit about local optimal and saddle points in new networks so you can have a little bit better intuition about the types of optimization problems your optimization algorithm is trying to solve when you're trying to train these in your network let's go onto the next video to see that

Original Description

Take the Deep Learning Specialization: http://bit.ly/2Tx69W7 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 11 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

The video teaches learning rate decay, a technique to speed up learning algorithms, and provides examples of implementation, including exponential decay and discrete staircase decay. It also discusses hyper parameter tuning and optimization techniques.

Key Takeaways
  1. Implement mini batch gradient descent
  2. Set initial learning rate alpha
  3. Choose a decay rate
  4. Apply exponential decay or discrete staircase decay
  5. Tune hyper parameters
💡 Learning rate decay can help speed up training by reducing the learning rate over time, allowing for faster convergence to the optimal solution.

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data by encoding and scaling features for better machine learning model performance
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training
Medium · Data Science
The Python Dictionary Trick That Makes Interviewers Smile
Learn the Python dictionary trick that impresses interviewers and improves your coding skills
Dev.to · Ameer Abdullah
I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026
Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects
Medium · Python
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →