Dropout Regularization (C2W1L06)
Key Takeaways
The video discusses Dropout Regularization, a technique to prevent overfitting in neural networks, and its implementation using inverted dropout, with a focus on supervised learning and deep learning techniques.
Full Transcript
in addition to l2 regularization another very powerful regularization technique is called drop out let's see how that works let's say you've trained a neural network where the one on the left and is overfitting just what you do with dropout let me make a copy of the neural network with dropout what we're going to do is go through each of the layers of the network and set some probability of eliminating a node in your network so let's say that for each of these layers we're going to for each note of the coin and have a 0.5 challenge of keeping each node and 0.5 cons of removing each node so after the coin tosses maybe you decide to eliminate those nodes then what you do is actually remove all the in going outgoing links from that node as well so you end up with a much smaller really much diminished Network and then you do back propagation training this one example on this much diminished network and then on different example you would toss the set of coins again and keep a different set of nodes and they drop out eliminated different set of nodes and so for each training example you would train it using one of these newer reduce networks so maybe who seems like a slightly crazy technique to just go go around killing those at random but this actually works but you can imagine that because you're training a much smaller network on each example maybe you know this gives a sense of why you end up able to regularize the network because these much smaller networks are being trained so let's look at how you can implement dropout there are a few ways of implementing drop-off I'm going to show you the most common one which is technique called inverted dropout for the sake of completeness let's say we want to illustrate this wave from layer L equals V so in the code I'm going to write there'll be a bunch of threes here that's just you know that I'm just a little tree health into and drop out in a single layer so one thing to do is a vector D G 3 is going to be the dropout vector for layer 3 that's what 3 is to be MP dot random dot R and and then is going to be the same shape as a 3 and going to see if this is less than some number which I'm gonna call keep problem and so cheap prop is a number it was 0.5 on the previous slide and maybe now I use 0.8 in this example and there'll be the probability that a given hidden unit will be kept so if key problem is equal to 0.8 then this means that there's a 0.2 chance of eliminating any hidden unit so what this does is it generates a random matrix um and this works as well if you have vectorized but so DC will be a matrix where for each example and the each hidden unit there's a 0.8 chance that the corresponding DC will be 1 and the 20% chance will be 0 all right so no this this random number being less than 0.8 there's a point a chance at being 1 or being true and at 2015 Johnson playing to charms are being false of being 0 and then what you're going to do is take your activations from the 3rd layer I'm just call it a fee in this little example so a 3 are the activations you compute it and I'm going to send a 3 to be equal to the old a 3 times ok so there's an element-wise multiplication or I guess you could also write this as a a 3 x equals d3 but what this does is for every element of DV that's equal to 0 and there's a 20% chance of each of the elements being 0 you end up this multiplier operation ends up zeroing out the corresponding element of DC well if you do this in Python technically d3 will be a boolean array what values true and false rather than 1 and 0 but it'll multiply the multiplier operation we're going to interpret the true and false values as 1 and 0 if you try to just open Python you you see then finally we're going to take a 3 and scale it up by dividing by 0.8 or really dividing by our cheap prop parameter so let me explain what this final step is doing let's say for the sake of argument then you have 50 units or 50 neurons in the third hidden layer so maybe a three is fifty by one dimensional or if your factorization will be 50 by M dimensional so if you have a eighty percent chance of keeping them type is enchanted eliminating them this means that on average you end up with ten units you know shut off for 10 units zero and so now if you look at the value of V 4 V 4 is going to be equal to W 4 times a 3 plus B 4 and so on expectation this will be reduced by 20% by which I mean that 20% of the elements of a 3 will be 0 L so in order to not reduce the expected value of B 4 what you do is you need to take this and divide it by 0.8 because this will you know correct or just bump it back up by the roughly 20% a unique so it's to not change the expected value of a 3 and so this line here is what's called the inverted dropout technique and this effect is that no matter what you said the key prop to whether there's point 8 or 4 9 or even one it deserves a wonder there's no drop out because you keeping everything 0.5 or whatever this inverted dropout technique by dividing by the key prop it ensures that the expected value of a3 remains the same and it turns out that at test time when you're trying to be valid in your network we stop on the next slide this inverted dropout technique there's this lines etc through the green box around this makes test time easier because you have less of a scaling problem but by far the most common implantation of drop-off today or as I know is inverted dropouts I recommend you just links mentis but there were some very iterations of dropout then miss this / g probe line and so at test time the album became involved in more complicated but but again people tend not to use those other versions so what you do is you use the D vector and you notice that for different training examples you zero out different hidden unions and in fact if you make multiple passes through the same training set then on different pulses through the training set you should randomly zero different hidden units so it's not that for one example you should keep dealing of the drift same hidden units is that on iteration one of gradient descent you might zero something in unions and on the second iteration again this and we go through the training set a second time maybe you set zero in a different pattern of hidden units and the vector D or D three for the third layer is used to decide what's a zero out both in for prop as well as in that problem just showing forward prop here now having trained the algorithm at test times here's what you would do at time you're given some X on which you want to make a prediction and using our standard notation I'm going to use a zero activations of the zero of layer to denote this test example X so what we're going to do is not use dropout at test time in particular which is going to set Z 1 equals W 1 a 0 plus B 1 a 1 equals G 1 of Z 1 Z 2 equals W 2 a 1 plus B 2 a 2 equals and so on until where you get to the last layer in the you make a prediction Y hat but notice that at test time you're not using dropout explicitly and then you're not tossing coins around them you're not flipping coins to decide which hidden units to eliminate and that's because we're making predictions there test time you don't really want your output to be random if you were implementing dropout at test time that just add noise to your predictions in theory one thing you could do is run the prediction process many times with different hidden units randomly drop-down and then average across them but that's computationally inefficient and it gives you roughly the same result very very similar result to this to the procedure as well and I just mention the inverted dropout theorem step on a previous slide where we divided by the cheap problem The effect of that was ensure that even when you don't implement dropout and test time to the scaling the expected value of these activations don't change so you don't need to add in an extra funny scaling parameter at test time that's different than when you had a training time so that's dropouts and when your implant is in this week's for an exercise you gain more first-hand experience with it as well but why does it really work what I want to do in the next video is give you some better intuition about what dropout really is doing let's go on to the next video
Original Description
Take the Deep Learning Specialization: http://bit.ly/2x5Z9YT
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 52 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
▶
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI