Vanishing/Exploding Gradients (C2W1L10)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

The video discusses the problem of vanishing and exploding gradients in deep neural networks, and how careful initialization of weights can help alleviate this issue. It covers the mathematical explanation of how gradients can explode or vanish exponentially with the number of layers, and the impact on training deep networks.

Full Transcript

one of the problems with training your network especially very deep neural networks is that are vanishing and exploding gradients what that means is that when you're training a very deep network you're derivatives or your slopes can sometimes get you to very very big or very very small maybe even exponentially small and this makes training difficult in this video you see what this problem of exploding or that vanishing gradients really means as well as how you can use careful choices of the random way the initialization to significantly reduce this problem less your training very deep neural network like this the same space on this slide I've drawn it as if you have only two hidden units per layer but it could be more as well but this neural network will have parameters W 1 W 2 W 3 and so on up to WL for the sake of simplicity let's say we're using an activation function G of Z equals Z so a linear activation function and let's ignore be the set B of l equals 0 so in that case you can show that the output Y will be W L times W 0 minus 1 times WL minus 2 dot dot down to w3 W 2 W 1 times X that means if you want to just check my math W 1 times X is going to be Z 1 right because B where is equal to 0 so Z 1 is equal to I guess W 1 times X and then plug V which is 0 but then a 1 is equal to G of Z 1 but because you use a linear activation function this is just equal to Z 1 so this first term W 1 X is equal to a 1 and then by still everything you can figure out that W 2 times W 1 times X is equal to a 2 because that's going to be G of Z 2 is going to be G of W 2 times a 1 which implies that in here so this thing is going to be equal to a two and then yo this thing is going to be a three and so on until the products all these matrices gives you the Y hat not Y now let's say that each of your weight matrices WL is equal to let's say is just a little bit larger than one time's the identity so it's one point five one point five zero zero right technically the last one has different dimensions so maybe this is just the rest of these void matrices then Y hat will be you know ignoring those last ones different dimension will be this one point five zero zero one point five matrix to the power of L minus one times X because if we assume that each one of these matrices you know is equal to this thing is really one point five times the identity matrix then you end up with this calculation and so Y hat will be essentially one point five to the power of L mm mm minus one times X and if L is large for very deep neural network Y has will be very large in fact this grows exponentially it grows like one point five to the number of layers and so if you have a very deep neural network the value of y will explode now conversely if we replace this with zero point five so something less than one then this becomes zero point five to the power of L where this matrix um becomes zero point five to the o minus one times X we can ignoring WL but so each of your matrices are less than one then if let's say X 1 X 2 where 1 1 then the activations would be 1/2 1/2 1/4 1/4 1/8 1/8 and so on until this becomes a right 1 over 2 to the L so the activation values will decrease exponentially as a function of the deaf as a function the number of layers elves in network so they be very deep network these activations end up decreasing exponentially so the intuition I hope you can take away from this is that if the weights W if they're all you know just a little bit bigger than one I'll just work with bigger then the identity matrix then with a very deep network the activations can explode and if W is you know just a little bit that's the identity right so this was maybe is 0.9 0.9 right then if a very deep network the activations will decrease exponentially and even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of level the similar argument can be used to show that the derivatives or the gradients you compete with we understand will also increase exponentially or decrease exponentially as a function of the number of layers where some of the modern neural networks you actually have l equals hundred and fifty Microsoft basically got great results of encountering 52 layer in your network but whether such a deep neural network if your activations your gradient increase or decrease exponentially as a function of L then these values could get really big or really small and this makes training difficult especially if your gradients are exponentially small in elm then you know gradient descents will take tiny little steps they'll take a long time for gradient descent to learn anything to summarize you've seen how deep networks suffer from the problems of vanishing or exploding gradients in fact for a long time this problem was a huge barrier to training deep neural networks it turns out there's a partial solution that doesn't completely solve this problem but that helps a lot which is careful choice of how you initialize the weights to see that let's go on to the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2vzq1jp Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 53 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

The video explains the problem of vanishing and exploding gradients in deep neural networks, and how careful weight initialization can help. It covers the mathematical explanation of the issue and its impact on training deep networks.

Key Takeaways
  1. Understand the concept of vanishing and exploding gradients
  2. Recognize the importance of weight initialization in deep learning
  3. Apply weight initialization techniques to mitigate vanishing and exploding gradients
  4. Implement deep neural networks using gradient descent
💡 Careful weight initialization can help alleviate the problem of vanishing and exploding gradients in deep neural networks.

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data by encoding and scaling features for better machine learning model performance
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training
Medium · Data Science
The Python Dictionary Trick That Makes Interviewers Smile
Learn the Python dictionary trick that impresses interviewers and improves your coding skills
Dev.to · Ameer Abdullah
I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026
Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects
Medium · Python
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →