Vanishing/Exploding Gradients (C2W1L10)
Key Takeaways
The video discusses the problem of vanishing and exploding gradients in deep neural networks, and how careful initialization of weights can help alleviate this issue. It covers the mathematical explanation of how gradients can explode or vanish exponentially with the number of layers, and the impact on training deep networks.
Full Transcript
one of the problems with training your network especially very deep neural networks is that are vanishing and exploding gradients what that means is that when you're training a very deep network you're derivatives or your slopes can sometimes get you to very very big or very very small maybe even exponentially small and this makes training difficult in this video you see what this problem of exploding or that vanishing gradients really means as well as how you can use careful choices of the random way the initialization to significantly reduce this problem less your training very deep neural network like this the same space on this slide I've drawn it as if you have only two hidden units per layer but it could be more as well but this neural network will have parameters W 1 W 2 W 3 and so on up to WL for the sake of simplicity let's say we're using an activation function G of Z equals Z so a linear activation function and let's ignore be the set B of l equals 0 so in that case you can show that the output Y will be W L times W 0 minus 1 times WL minus 2 dot dot down to w3 W 2 W 1 times X that means if you want to just check my math W 1 times X is going to be Z 1 right because B where is equal to 0 so Z 1 is equal to I guess W 1 times X and then plug V which is 0 but then a 1 is equal to G of Z 1 but because you use a linear activation function this is just equal to Z 1 so this first term W 1 X is equal to a 1 and then by still everything you can figure out that W 2 times W 1 times X is equal to a 2 because that's going to be G of Z 2 is going to be G of W 2 times a 1 which implies that in here so this thing is going to be equal to a two and then yo this thing is going to be a three and so on until the products all these matrices gives you the Y hat not Y now let's say that each of your weight matrices WL is equal to let's say is just a little bit larger than one time's the identity so it's one point five one point five zero zero right technically the last one has different dimensions so maybe this is just the rest of these void matrices then Y hat will be you know ignoring those last ones different dimension will be this one point five zero zero one point five matrix to the power of L minus one times X because if we assume that each one of these matrices you know is equal to this thing is really one point five times the identity matrix then you end up with this calculation and so Y hat will be essentially one point five to the power of L mm mm minus one times X and if L is large for very deep neural network Y has will be very large in fact this grows exponentially it grows like one point five to the number of layers and so if you have a very deep neural network the value of y will explode now conversely if we replace this with zero point five so something less than one then this becomes zero point five to the power of L where this matrix um becomes zero point five to the o minus one times X we can ignoring WL but so each of your matrices are less than one then if let's say X 1 X 2 where 1 1 then the activations would be 1/2 1/2 1/4 1/4 1/8 1/8 and so on until this becomes a right 1 over 2 to the L so the activation values will decrease exponentially as a function of the deaf as a function the number of layers elves in network so they be very deep network these activations end up decreasing exponentially so the intuition I hope you can take away from this is that if the weights W if they're all you know just a little bit bigger than one I'll just work with bigger then the identity matrix then with a very deep network the activations can explode and if W is you know just a little bit that's the identity right so this was maybe is 0.9 0.9 right then if a very deep network the activations will decrease exponentially and even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of level the similar argument can be used to show that the derivatives or the gradients you compete with we understand will also increase exponentially or decrease exponentially as a function of the number of layers where some of the modern neural networks you actually have l equals hundred and fifty Microsoft basically got great results of encountering 52 layer in your network but whether such a deep neural network if your activations your gradient increase or decrease exponentially as a function of L then these values could get really big or really small and this makes training difficult especially if your gradients are exponentially small in elm then you know gradient descents will take tiny little steps they'll take a long time for gradient descent to learn anything to summarize you've seen how deep networks suffer from the problems of vanishing or exploding gradients in fact for a long time this problem was a huge barrier to training deep neural networks it turns out there's a partial solution that doesn't completely solve this problem but that helps a lot which is careful choice of how you initialize the weights to see that let's go on to the next video
Original Description
Take the Deep Learning Specialization: http://bit.ly/2vzq1jp
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 53 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
▶
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Data Science
The Python Dictionary Trick That Makes Interviewers Smile
Dev.to · Ameer Abdullah
I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026
Medium · Python
🎓
Tutor Explanation
DeepCamp AI