Learning Rate Decay (C2W2L09)
Key Takeaways
The video discusses learning rate decay, a technique to speed up learning algorithms by slowly reducing the learning rate over time, and provides examples of how to implement it, including exponential decay and discrete staircase decay.
Full Transcript
one of the things that might help speed up your learning algorithm is to slowly reduce your learning rate over time we call this learning rate decay let's see how you can implement this let's start - an example of why you might want to implement learning rate decay suppose you're implementing mini batch gradient descent with a reasonably small mini batch maybe a mini batch has just 64 128 examples then as you iterate your steps will be a little bit noisy and it will tend toward this minimum over here but it won't exactly converge but your algorithm might just end up wandering around and never really converge because you're using some fixed value for alpha and there's just some noise in your different mini batches but if you were to slowly reduce your learning rate alpha then during the initial phases while your learning rate alpha still lasts you can still have it to be fast learning but then as alpha gets smaller your steps you take would be slower and smaller and so you end up oscillating in a tighter region around this minimum rather than one ring far away even as training goes on and on so the intuition behind slowly reducing alpha is that maybe during the initial steps of learning you can afford to take much bigger steps but then as learning approaches convergence then having a slower learning rate allows you to take smaller steps so here's how you can implement learning rate decay recall that one epoch is one class through the data right so if you have them a training set as follows maybe break it up into different mini batches then once the first pass through the training set is called the first epoch and then the second pass is the second epoch and so on so one thing you could do is set your learning rate alpha to be equal to one over one plus a per hour originally called the decay rate times the epoch num and there's going to be times some initial learning rate alpha zero note that the decay rate here it becomes another hyper parameter which you might need to tune so here's a concrete example um if you take several epochs so several passes through your data if alpha zero is equal to zero point two and the decay rate is equal to one then doing your first epoch alpha will be 1 over 1 plus 1 times alpha 0 so your learning rate will be zero point one that's just your evaluating this formula when the decay rate is equal to 1 and the epochal on this one on the second you pop your learning rate the case to 0.67 on the third 0.5 on the fourth 0.4 and so on fearful evaluate well these values yourself and get a sense that you know as a function of your epoch number your learning rate gradually decreases whereas this according to this formula up on top so if you wish to use learning rate decay what you can do is try to provide your values of both hyper parameter alpha 0 as well as of this decay rate hyper parameter and then try to find a value that works well other than this formula for learning rate decay there are a few other ways that people use for example this is called exponential decay where alpha is equal to some number less than 1 such as 0.9 5 times epoch num times alpha 0 so this will exponentially quickly decay your learning rate other formulas that people use are things like alpha equals some constant over EPOC numb square root times alpha zero or some constants cave another hyper counter over dr.mini Bosch number P square root 2 times alpha zero and sometimes you also see people use a learning rate that decreases and discrete stats where for some number of steps you have some learning rate and then after a while you decrease it by one half after a while by one half after a while by one half and so this is a discrete staircase so so far we've talked about some using some you know formula to govern how alpha the learning rate changes over time one other thing that people sometimes do is nanyo decay and so if you're training just one model at a time and the dual model takes many hours or even many days to Train what some people will do is just wash your model as this training over your a large number of days and then annually say oh it looks like the learning rate slowed down I'm going to decrease out for a little bit of course this works this manually controlling alpha really tuning alpha by hand all by hour day by day this works only if you're training only a small number of models but sometimes people do that as well so now you have a few more options so how to control the learning rate alpha now in case you're thinking wow this is a lot of hyper parameters how that select amongst all these different options I would say don't worry about it for now in next week we'll talk more about how to systematically choose hyper parameters for me I would say that learning rate is usually lower down or the list of things I try setting alpha just a fixed value of alpha and getting that to be wealthy and has a huge in time learning rate decay does help sometimes it can really help speed up training but it is a little bit lower down my list when in terms of the things I would try but next we want to talk about hyper parameter tuning you see more systematic ways to organize all of these hyper parameters and how to efficiently search amongst them so that's it for learning rate is hey um finally I want to also want to talk a little bit about local optimal and saddle points in new networks so you can have a little bit better intuition about the types of optimization problems your optimization algorithm is trying to solve when you're trying to train these in your network let's go onto the next video to see that
Original Description
Take the Deep Learning Specialization: http://bit.ly/2Tx69W7
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 11 of 60
1
2
3
4
5
6
7
8
9
10
▶
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Data Science
The Python Dictionary Trick That Makes Interviewers Smile
Dev.to · Ameer Abdullah
I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026
Medium · Python
🎓
Tutor Explanation
DeepCamp AI