Understanding Dropout (C2W1L07)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Skills: ML Maths Basics70%Supervised Learning50%

Key Takeaways

The video discusses the concept of dropout in neural networks, its implementation, and its effects on preventing overfitting, with a focus on its adaptive form of L2 regularization and its application in different layers and scenarios.

Full Transcript

drop out does this seemingly crazy thing of randomly knocking out eun-seo Network why does it work so well as a regulator let's gain some better intuition in the previous video I gave this intuition that drop out randomly knocks out units your network so it's as if on every iteration you're working with the smaller neural network and so using a smaller neural network seems like it should have a regular izing effect here's the second intuition which is you know let's look at it from the perspective of a single unit all right let's say this one now for this unit to do is job as for input then needs to generate some meaningful output now with dropout the inputs can get randomly eliminated you know sometimes those two units will get eliminated sometimes a different unit will get eliminated so what these are at this unit which I'm circling purple it can't rely on any one feature because any one feature could go away at random or any one of its own influence could go away in random so in particular be reluctant to put all of this bets on say just this input right the ways we reluctant to put too much weight on any one input because army can go away so this unit would be more motivated to straight out this weight and give you a little bit of weight to each of the four inputs to this unit and by spreading all the weights this will tend to have an effect of shrinking the squared norm of the waste and so similar to what we saw with l2 regularization the effect of implementing dropout is that it strings aways and does similar to l2 regularization it helps to prevent overfitting but it turns out that dropout can formerly be shown to be an adaptive form of l2 regularization but the l2 penalty on different waves are different depending on the size of the activations being multiplied into that weight but to summarize it is possible to show that dropout has a no similar effect to l2 regularization only the l2 regularization applied to different ways can be a little bit different and even more adaptive than scale of different inputs one more detail when you're implementing dropout she's a network where you have three input features this is 7 7 units 0 7 3 2 1 so one of the parameters we have to choose was the cheap profit which is a charm to keeping a unit in each layer so it is also feasible to very key prop by layer so for the first layer your matrix W 1 will be 3 by 7 your second weight matrix will be 7 by 7 W 3 will be 7 by 3 and so on and so W 2 is actually the biggest weight matrix right those are actually the largest of the parameters will be in W 2 which is 7 by 7 so to prevent to reduce over setting of that matrix maybe for this layer I guess this is layer 2 you might have a cheap cost as relatively low say 0.5 where's 4 different layers where you might worry less well just again you could have a higher key problem in reducing 0.7 maybe this is 0.7 and if the layers we don't worry about overfitting at all you can have a keep drop of 1.0 right so you know for clarity these are numbers I'm drawing in the purple boxes these could be different key prompts for different layers notice that the key problem 1.0 means that you're keeping every unit and so you're really not using broad drop out for that layer but the layers where you're more worried about overfitting really the layers of all the parameters you could say key prompt to be smaller to apply a more powerful form of dropout it's kind of like cranking up the regularization parameter lambda of l2 regularization when you try to regularize some layers more than others and technically you can also apply dropout to the input layer where you can have some cons of you know just acting on one or more of the input features although in practice usually don't do that that often and so Chih problem 1.0 is quite common for the input layer you might also use a very high value is 0.9 but it's much less likely that you know you once eliminate half of the input features so usually key problem if you apply that all will be a number close to one if you even apply dropout at all to the input layer so just to summarize if you are more worried about some layers overfitting than others you can set a lower key prop for some layers than others the downside is this gives you even more hyper parameters to search for using cross-validation one other alternative might be to have some layers where you apply dropout in some ways we don't apply dropout and in terms of one hyper parameter which is the key prop for the layers for which you do apply dropout and before we wrap up just a couple implementational tips many of the first successful implementations of dropouts were to computer vision so in computer vision the input size is so big you in putting all these pixels that you almost never have enough data and so dropout is very frequently used by it in computer vision and there are some conservation researchers that pretty much always use it almost as a default but really the thing to remember is that dropout is a regularization technique it helps prevent overfitting and so unless my algorithm is over fitting I wouldn't actually bother the use drop also is used somewhat less often than other application areas it's just the computer vision you know you usually just don't have an update or so you're almost always overfitting which is why they tend to be some computer vision researchers square by drop out by the intuition always doesn't always generalize I think to other disciplines one big downside of dropouts is that the cost function J is no longer well defined on every iteration they are randomly you know killing off a bunch of nodes and so if you are double checking the performance or gradient descent it is actually harder to double check that and you have a well-defined cost function J that is going downhill on every iteration because the cost function J that you're optimizing is actually less less well defined there is a surfing hard to calculate so you lose the debugging tool through the plot a graph like this so what I usually do is turn off drop out of you will set key property equals one and it run my code make sure that it is monotonically decreasing J and then turn on dropout and hope that you know I didn't introduce a welcome to my code during drop out because you need other ways I guess but not plotting these figures to make sure that your code is working the Granderson is working even with drop out so with that there are so a few more regularization techniques that work your knowing let's talk about a few more such techniques in the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2PGxIeE Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 38 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

The video explains the concept of dropout in neural networks, its implementation, and its effects on preventing overfitting, with a focus on its adaptive form of L2 regularization and its application in different layers and scenarios. Dropout is a regularization technique that helps prevent overfitting by randomly knocking out units in a neural network, and its keep probability can be adjusted for different layers. The video also discusses the downsides of dropout, including the loss of a well-d

Key Takeaways

Understand the concept of dropout and its purpose
Implement dropout in a neural network
Adjust the keep probability for different layers
Use cross-validation to find the optimal keep probability
Turn off dropout to debug the code and ensure the cost function is well-defined

💡 Dropout is an adaptive form of L2 regularization that helps prevent overfitting in neural networks by randomly knocking out units, and its keep probability can be adjusted for different layers to achieve the best results.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train