Understanding Dropout (C2W1L07)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

The video discusses the concept of dropout in neural networks, its implementation, and its effects on preventing overfitting, with a focus on its adaptive form of L2 regularization and its application in different layers and scenarios.

Full Transcript

drop out does this seemingly crazy thing of randomly knocking out eun-seo Network why does it work so well as a regulator let's gain some better intuition in the previous video I gave this intuition that drop out randomly knocks out units your network so it's as if on every iteration you're working with the smaller neural network and so using a smaller neural network seems like it should have a regular izing effect here's the second intuition which is you know let's look at it from the perspective of a single unit all right let's say this one now for this unit to do is job as for input then needs to generate some meaningful output now with dropout the inputs can get randomly eliminated you know sometimes those two units will get eliminated sometimes a different unit will get eliminated so what these are at this unit which I'm circling purple it can't rely on any one feature because any one feature could go away at random or any one of its own influence could go away in random so in particular be reluctant to put all of this bets on say just this input right the ways we reluctant to put too much weight on any one input because army can go away so this unit would be more motivated to straight out this weight and give you a little bit of weight to each of the four inputs to this unit and by spreading all the weights this will tend to have an effect of shrinking the squared norm of the waste and so similar to what we saw with l2 regularization the effect of implementing dropout is that it strings aways and does similar to l2 regularization it helps to prevent overfitting but it turns out that dropout can formerly be shown to be an adaptive form of l2 regularization but the l2 penalty on different waves are different depending on the size of the activations being multiplied into that weight but to summarize it is possible to show that dropout has a no similar effect to l2 regularization only the l2 regularization applied to different ways can be a little bit different and even more adaptive than scale of different inputs one more detail when you're implementing dropout she's a network where you have three input features this is 7 7 units 0 7 3 2 1 so one of the parameters we have to choose was the cheap profit which is a charm to keeping a unit in each layer so it is also feasible to very key prop by layer so for the first layer your matrix W 1 will be 3 by 7 your second weight matrix will be 7 by 7 W 3 will be 7 by 3 and so on and so W 2 is actually the biggest weight matrix right those are actually the largest of the parameters will be in W 2 which is 7 by 7 so to prevent to reduce over setting of that matrix maybe for this layer I guess this is layer 2 you might have a cheap cost as relatively low say 0.5 where's 4 different layers where you might worry less well just again you could have a higher key problem in reducing 0.7 maybe this is 0.7 and if the layers we don't worry about overfitting at all you can have a keep drop of 1.0 right so you know for clarity these are numbers I'm drawing in the purple boxes these could be different key prompts for different layers notice that the key problem 1.0 means that you're keeping every unit and so you're really not using broad drop out for that layer but the layers where you're more worried about overfitting really the layers of all the parameters you could say key prompt to be smaller to apply a more powerful form of dropout it's kind of like cranking up the regularization parameter lambda of l2 regularization when you try to regularize some layers more than others and technically you can also apply dropout to the input layer where you can have some cons of you know just acting on one or more of the input features although in practice usually don't do that that often and so Chih problem 1.0 is quite common for the input layer you might also use a very high value is 0.9 but it's much less likely that you know you once eliminate half of the input features so usually key problem if you apply that all will be a number close to one if you even apply dropout at all to the input layer so just to summarize if you are more worried about some layers overfitting than others you can set a lower key prop for some layers than others the downside is this gives you even more hyper parameters to search for using cross-validation one other alternative might be to have some layers where you apply dropout in some ways we don't apply dropout and in terms of one hyper parameter which is the key prop for the layers for which you do apply dropout and before we wrap up just a couple implementational tips many of the first successful implementations of dropouts were to computer vision so in computer vision the input size is so big you in putting all these pixels that you almost never have enough data and so dropout is very frequently used by it in computer vision and there are some conservation researchers that pretty much always use it almost as a default but really the thing to remember is that dropout is a regularization technique it helps prevent overfitting and so unless my algorithm is over fitting I wouldn't actually bother the use drop also is used somewhat less often than other application areas it's just the computer vision you know you usually just don't have an update or so you're almost always overfitting which is why they tend to be some computer vision researchers square by drop out by the intuition always doesn't always generalize I think to other disciplines one big downside of dropouts is that the cost function J is no longer well defined on every iteration they are randomly you know killing off a bunch of nodes and so if you are double checking the performance or gradient descent it is actually harder to double check that and you have a well-defined cost function J that is going downhill on every iteration because the cost function J that you're optimizing is actually less less well defined there is a surfing hard to calculate so you lose the debugging tool through the plot a graph like this so what I usually do is turn off drop out of you will set key property equals one and it run my code make sure that it is monotonically decreasing J and then turn on dropout and hope that you know I didn't introduce a welcome to my code during drop out because you need other ways I guess but not plotting these figures to make sure that your code is working the Granderson is working even with drop out so with that there are so a few more regularization techniques that work your knowing let's talk about a few more such techniques in the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2PGxIeE Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 38 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

The video explains the concept of dropout in neural networks, its implementation, and its effects on preventing overfitting, with a focus on its adaptive form of L2 regularization and its application in different layers and scenarios. Dropout is a regularization technique that helps prevent overfitting by randomly knocking out units in a neural network, and its keep probability can be adjusted for different layers. The video also discusses the downsides of dropout, including the loss of a well-d

Key Takeaways
  1. Understand the concept of dropout and its purpose
  2. Implement dropout in a neural network
  3. Adjust the keep probability for different layers
  4. Use cross-validation to find the optimal keep probability
  5. Turn off dropout to debug the code and ensure the cost function is well-defined
💡 Dropout is an adaptive form of L2 regularization that helps prevent overfitting in neural networks by randomly knocking out units, and its keep probability can be adjusted for different layers to achieve the best results.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →