Understanding Dropout (C2W1L07)
Key Takeaways
The video discusses the concept of dropout in neural networks, its implementation, and its effects on preventing overfitting, with a focus on its adaptive form of L2 regularization and its application in different layers and scenarios.
Full Transcript
drop out does this seemingly crazy thing of randomly knocking out eun-seo Network why does it work so well as a regulator let's gain some better intuition in the previous video I gave this intuition that drop out randomly knocks out units your network so it's as if on every iteration you're working with the smaller neural network and so using a smaller neural network seems like it should have a regular izing effect here's the second intuition which is you know let's look at it from the perspective of a single unit all right let's say this one now for this unit to do is job as for input then needs to generate some meaningful output now with dropout the inputs can get randomly eliminated you know sometimes those two units will get eliminated sometimes a different unit will get eliminated so what these are at this unit which I'm circling purple it can't rely on any one feature because any one feature could go away at random or any one of its own influence could go away in random so in particular be reluctant to put all of this bets on say just this input right the ways we reluctant to put too much weight on any one input because army can go away so this unit would be more motivated to straight out this weight and give you a little bit of weight to each of the four inputs to this unit and by spreading all the weights this will tend to have an effect of shrinking the squared norm of the waste and so similar to what we saw with l2 regularization the effect of implementing dropout is that it strings aways and does similar to l2 regularization it helps to prevent overfitting but it turns out that dropout can formerly be shown to be an adaptive form of l2 regularization but the l2 penalty on different waves are different depending on the size of the activations being multiplied into that weight but to summarize it is possible to show that dropout has a no similar effect to l2 regularization only the l2 regularization applied to different ways can be a little bit different and even more adaptive than scale of different inputs one more detail when you're implementing dropout she's a network where you have three input features this is 7 7 units 0 7 3 2 1 so one of the parameters we have to choose was the cheap profit which is a charm to keeping a unit in each layer so it is also feasible to very key prop by layer so for the first layer your matrix W 1 will be 3 by 7 your second weight matrix will be 7 by 7 W 3 will be 7 by 3 and so on and so W 2 is actually the biggest weight matrix right those are actually the largest of the parameters will be in W 2 which is 7 by 7 so to prevent to reduce over setting of that matrix maybe for this layer I guess this is layer 2 you might have a cheap cost as relatively low say 0.5 where's 4 different layers where you might worry less well just again you could have a higher key problem in reducing 0.7 maybe this is 0.7 and if the layers we don't worry about overfitting at all you can have a keep drop of 1.0 right so you know for clarity these are numbers I'm drawing in the purple boxes these could be different key prompts for different layers notice that the key problem 1.0 means that you're keeping every unit and so you're really not using broad drop out for that layer but the layers where you're more worried about overfitting really the layers of all the parameters you could say key prompt to be smaller to apply a more powerful form of dropout it's kind of like cranking up the regularization parameter lambda of l2 regularization when you try to regularize some layers more than others and technically you can also apply dropout to the input layer where you can have some cons of you know just acting on one or more of the input features although in practice usually don't do that that often and so Chih problem 1.0 is quite common for the input layer you might also use a very high value is 0.9 but it's much less likely that you know you once eliminate half of the input features so usually key problem if you apply that all will be a number close to one if you even apply dropout at all to the input layer so just to summarize if you are more worried about some layers overfitting than others you can set a lower key prop for some layers than others the downside is this gives you even more hyper parameters to search for using cross-validation one other alternative might be to have some layers where you apply dropout in some ways we don't apply dropout and in terms of one hyper parameter which is the key prop for the layers for which you do apply dropout and before we wrap up just a couple implementational tips many of the first successful implementations of dropouts were to computer vision so in computer vision the input size is so big you in putting all these pixels that you almost never have enough data and so dropout is very frequently used by it in computer vision and there are some conservation researchers that pretty much always use it almost as a default but really the thing to remember is that dropout is a regularization technique it helps prevent overfitting and so unless my algorithm is over fitting I wouldn't actually bother the use drop also is used somewhat less often than other application areas it's just the computer vision you know you usually just don't have an update or so you're almost always overfitting which is why they tend to be some computer vision researchers square by drop out by the intuition always doesn't always generalize I think to other disciplines one big downside of dropouts is that the cost function J is no longer well defined on every iteration they are randomly you know killing off a bunch of nodes and so if you are double checking the performance or gradient descent it is actually harder to double check that and you have a well-defined cost function J that is going downhill on every iteration because the cost function J that you're optimizing is actually less less well defined there is a surfing hard to calculate so you lose the debugging tool through the plot a graph like this so what I usually do is turn off drop out of you will set key property equals one and it run my code make sure that it is monotonically decreasing J and then turn on dropout and hope that you know I didn't introduce a welcome to my code during drop out because you need other ways I guess but not plotting these figures to make sure that your code is working the Granderson is working even with drop out so with that there are so a few more regularization techniques that work your knowing let's talk about a few more such techniques in the next video
Original Description
Take the Deep Learning Specialization: http://bit.ly/2PGxIeE
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 38 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
▶
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI