Dropout Regularization (C2W1L06)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

The video discusses Dropout Regularization, a technique to prevent overfitting in neural networks, and its implementation using inverted dropout, with a focus on supervised learning and deep learning techniques.

Full Transcript

in addition to l2 regularization another very powerful regularization technique is called drop out let's see how that works let's say you've trained a neural network where the one on the left and is overfitting just what you do with dropout let me make a copy of the neural network with dropout what we're going to do is go through each of the layers of the network and set some probability of eliminating a node in your network so let's say that for each of these layers we're going to for each note of the coin and have a 0.5 challenge of keeping each node and 0.5 cons of removing each node so after the coin tosses maybe you decide to eliminate those nodes then what you do is actually remove all the in going outgoing links from that node as well so you end up with a much smaller really much diminished Network and then you do back propagation training this one example on this much diminished network and then on different example you would toss the set of coins again and keep a different set of nodes and they drop out eliminated different set of nodes and so for each training example you would train it using one of these newer reduce networks so maybe who seems like a slightly crazy technique to just go go around killing those at random but this actually works but you can imagine that because you're training a much smaller network on each example maybe you know this gives a sense of why you end up able to regularize the network because these much smaller networks are being trained so let's look at how you can implement dropout there are a few ways of implementing drop-off I'm going to show you the most common one which is technique called inverted dropout for the sake of completeness let's say we want to illustrate this wave from layer L equals V so in the code I'm going to write there'll be a bunch of threes here that's just you know that I'm just a little tree health into and drop out in a single layer so one thing to do is a vector D G 3 is going to be the dropout vector for layer 3 that's what 3 is to be MP dot random dot R and and then is going to be the same shape as a 3 and going to see if this is less than some number which I'm gonna call keep problem and so cheap prop is a number it was 0.5 on the previous slide and maybe now I use 0.8 in this example and there'll be the probability that a given hidden unit will be kept so if key problem is equal to 0.8 then this means that there's a 0.2 chance of eliminating any hidden unit so what this does is it generates a random matrix um and this works as well if you have vectorized but so DC will be a matrix where for each example and the each hidden unit there's a 0.8 chance that the corresponding DC will be 1 and the 20% chance will be 0 all right so no this this random number being less than 0.8 there's a point a chance at being 1 or being true and at 2015 Johnson playing to charms are being false of being 0 and then what you're going to do is take your activations from the 3rd layer I'm just call it a fee in this little example so a 3 are the activations you compute it and I'm going to send a 3 to be equal to the old a 3 times ok so there's an element-wise multiplication or I guess you could also write this as a a 3 x equals d3 but what this does is for every element of DV that's equal to 0 and there's a 20% chance of each of the elements being 0 you end up this multiplier operation ends up zeroing out the corresponding element of DC well if you do this in Python technically d3 will be a boolean array what values true and false rather than 1 and 0 but it'll multiply the multiplier operation we're going to interpret the true and false values as 1 and 0 if you try to just open Python you you see then finally we're going to take a 3 and scale it up by dividing by 0.8 or really dividing by our cheap prop parameter so let me explain what this final step is doing let's say for the sake of argument then you have 50 units or 50 neurons in the third hidden layer so maybe a three is fifty by one dimensional or if your factorization will be 50 by M dimensional so if you have a eighty percent chance of keeping them type is enchanted eliminating them this means that on average you end up with ten units you know shut off for 10 units zero and so now if you look at the value of V 4 V 4 is going to be equal to W 4 times a 3 plus B 4 and so on expectation this will be reduced by 20% by which I mean that 20% of the elements of a 3 will be 0 L so in order to not reduce the expected value of B 4 what you do is you need to take this and divide it by 0.8 because this will you know correct or just bump it back up by the roughly 20% a unique so it's to not change the expected value of a 3 and so this line here is what's called the inverted dropout technique and this effect is that no matter what you said the key prop to whether there's point 8 or 4 9 or even one it deserves a wonder there's no drop out because you keeping everything 0.5 or whatever this inverted dropout technique by dividing by the key prop it ensures that the expected value of a3 remains the same and it turns out that at test time when you're trying to be valid in your network we stop on the next slide this inverted dropout technique there's this lines etc through the green box around this makes test time easier because you have less of a scaling problem but by far the most common implantation of drop-off today or as I know is inverted dropouts I recommend you just links mentis but there were some very iterations of dropout then miss this / g probe line and so at test time the album became involved in more complicated but but again people tend not to use those other versions so what you do is you use the D vector and you notice that for different training examples you zero out different hidden unions and in fact if you make multiple passes through the same training set then on different pulses through the training set you should randomly zero different hidden units so it's not that for one example you should keep dealing of the drift same hidden units is that on iteration one of gradient descent you might zero something in unions and on the second iteration again this and we go through the training set a second time maybe you set zero in a different pattern of hidden units and the vector D or D three for the third layer is used to decide what's a zero out both in for prop as well as in that problem just showing forward prop here now having trained the algorithm at test times here's what you would do at time you're given some X on which you want to make a prediction and using our standard notation I'm going to use a zero activations of the zero of layer to denote this test example X so what we're going to do is not use dropout at test time in particular which is going to set Z 1 equals W 1 a 0 plus B 1 a 1 equals G 1 of Z 1 Z 2 equals W 2 a 1 plus B 2 a 2 equals and so on until where you get to the last layer in the you make a prediction Y hat but notice that at test time you're not using dropout explicitly and then you're not tossing coins around them you're not flipping coins to decide which hidden units to eliminate and that's because we're making predictions there test time you don't really want your output to be random if you were implementing dropout at test time that just add noise to your predictions in theory one thing you could do is run the prediction process many times with different hidden units randomly drop-down and then average across them but that's computationally inefficient and it gives you roughly the same result very very similar result to this to the procedure as well and I just mention the inverted dropout theorem step on a previous slide where we divided by the cheap problem The effect of that was ensure that even when you don't implement dropout and test time to the scaling the expected value of these activations don't change so you don't need to add in an extra funny scaling parameter at test time that's different than when you had a training time so that's dropouts and when your implant is in this week's for an exercise you gain more first-hand experience with it as well but why does it really work what I want to do in the next video is give you some better intuition about what dropout really is doing let's go on to the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2x5Z9YT Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 52 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches Dropout Regularization, a powerful technique to prevent overfitting in neural networks, and its implementation using inverted dropout, with a focus on supervised learning and deep learning techniques. By watching this video, learners can understand how to implement dropout regularization and prevent overfitting in their own neural network models. This technique is crucial in deep learning as it helps to improve the generalization of models.

Key Takeaways
  1. Implement dropout in a neural network by generating a dropout vector for each layer
  2. Set the activations of the dropped units to zero and scale up the remaining units by the inverse of the keep probability
  3. Use inverted dropout by multiplying the activations of the dropped units by a random vector with the same shape as the activations
  4. Train the algorithm using the inverted dropout technique
  5. Make predictions at test time without using dropout
💡 Dropout regularization adds noise to hidden units to prevent overfitting, and inverting dropout during testing ensures the expected value of activations doesn't change

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →