Dropout Regularization (C2W1L06)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Skills: ML Maths Basics80%Supervised Learning70%

Key Takeaways

The video discusses Dropout Regularization, a technique to prevent overfitting in neural networks, and its implementation using inverted dropout, with a focus on supervised learning and deep learning techniques.

Full Transcript

in addition to l2 regularization another very powerful regularization technique is called drop out let's see how that works let's say you've trained a neural network where the one on the left and is overfitting just what you do with dropout let me make a copy of the neural network with dropout what we're going to do is go through each of the layers of the network and set some probability of eliminating a node in your network so let's say that for each of these layers we're going to for each note of the coin and have a 0.5 challenge of keeping each node and 0.5 cons of removing each node so after the coin tosses maybe you decide to eliminate those nodes then what you do is actually remove all the in going outgoing links from that node as well so you end up with a much smaller really much diminished Network and then you do back propagation training this one example on this much diminished network and then on different example you would toss the set of coins again and keep a different set of nodes and they drop out eliminated different set of nodes and so for each training example you would train it using one of these newer reduce networks so maybe who seems like a slightly crazy technique to just go go around killing those at random but this actually works but you can imagine that because you're training a much smaller network on each example maybe you know this gives a sense of why you end up able to regularize the network because these much smaller networks are being trained so let's look at how you can implement dropout there are a few ways of implementing drop-off I'm going to show you the most common one which is technique called inverted dropout for the sake of completeness let's say we want to illustrate this wave from layer L equals V so in the code I'm going to write there'll be a bunch of threes here that's just you know that I'm just a little tree health into and drop out in a single layer so one thing to do is a vector D G 3 is going to be the dropout vector for layer 3 that's what 3 is to be MP dot random dot R and and then is going to be the same shape as a 3 and going to see if this is less than some number which I'm gonna call keep problem and so cheap prop is a number it was 0.5 on the previous slide and maybe now I use 0.8 in this example and there'll be the probability that a given hidden unit will be kept so if key problem is equal to 0.8 then this means that there's a 0.2 chance of eliminating any hidden unit so what this does is it generates a random matrix um and this works as well if you have vectorized but so DC will be a matrix where for each example and the each hidden unit there's a 0.8 chance that the corresponding DC will be 1 and the 20% chance will be 0 all right so no this this random number being less than 0.8 there's a point a chance at being 1 or being true and at 2015 Johnson playing to charms are being false of being 0 and then what you're going to do is take your activations from the 3rd layer I'm just call it a fee in this little example so a 3 are the activations you compute it and I'm going to send a 3 to be equal to the old a 3 times ok so there's an element-wise multiplication or I guess you could also write this as a a 3 x equals d3 but what this does is for every element of DV that's equal to 0 and there's a 20% chance of each of the elements being 0 you end up this multiplier operation ends up zeroing out the corresponding element of DC well if you do this in Python technically d3 will be a boolean array what values true and false rather than 1 and 0 but it'll multiply the multiplier operation we're going to interpret the true and false values as 1 and 0 if you try to just open Python you you see then finally we're going to take a 3 and scale it up by dividing by 0.8 or really dividing by our cheap prop parameter so let me explain what this final step is doing let's say for the sake of argument then you have 50 units or 50 neurons in the third hidden layer so maybe a three is fifty by one dimensional or if your factorization will be 50 by M dimensional so if you have a eighty percent chance of keeping them type is enchanted eliminating them this means that on average you end up with ten units you know shut off for 10 units zero and so now if you look at the value of V 4 V 4 is going to be equal to W 4 times a 3 plus B 4 and so on expectation this will be reduced by 20% by which I mean that 20% of the elements of a 3 will be 0 L so in order to not reduce the expected value of B 4 what you do is you need to take this and divide it by 0.8 because this will you know correct or just bump it back up by the roughly 20% a unique so it's to not change the expected value of a 3 and so this line here is what's called the inverted dropout technique and this effect is that no matter what you said the key prop to whether there's point 8 or 4 9 or even one it deserves a wonder there's no drop out because you keeping everything 0.5 or whatever this inverted dropout technique by dividing by the key prop it ensures that the expected value of a3 remains the same and it turns out that at test time when you're trying to be valid in your network we stop on the next slide this inverted dropout technique there's this lines etc through the green box around this makes test time easier because you have less of a scaling problem but by far the most common implantation of drop-off today or as I know is inverted dropouts I recommend you just links mentis but there were some very iterations of dropout then miss this / g probe line and so at test time the album became involved in more complicated but but again people tend not to use those other versions so what you do is you use the D vector and you notice that for different training examples you zero out different hidden unions and in fact if you make multiple passes through the same training set then on different pulses through the training set you should randomly zero different hidden units so it's not that for one example you should keep dealing of the drift same hidden units is that on iteration one of gradient descent you might zero something in unions and on the second iteration again this and we go through the training set a second time maybe you set zero in a different pattern of hidden units and the vector D or D three for the third layer is used to decide what's a zero out both in for prop as well as in that problem just showing forward prop here now having trained the algorithm at test times here's what you would do at time you're given some X on which you want to make a prediction and using our standard notation I'm going to use a zero activations of the zero of layer to denote this test example X so what we're going to do is not use dropout at test time in particular which is going to set Z 1 equals W 1 a 0 plus B 1 a 1 equals G 1 of Z 1 Z 2 equals W 2 a 1 plus B 2 a 2 equals and so on until where you get to the last layer in the you make a prediction Y hat but notice that at test time you're not using dropout explicitly and then you're not tossing coins around them you're not flipping coins to decide which hidden units to eliminate and that's because we're making predictions there test time you don't really want your output to be random if you were implementing dropout at test time that just add noise to your predictions in theory one thing you could do is run the prediction process many times with different hidden units randomly drop-down and then average across them but that's computationally inefficient and it gives you roughly the same result very very similar result to this to the procedure as well and I just mention the inverted dropout theorem step on a previous slide where we divided by the cheap problem The effect of that was ensure that even when you don't implement dropout and test time to the scaling the expected value of these activations don't change so you don't need to add in an extra funny scaling parameter at test time that's different than when you had a training time so that's dropouts and when your implant is in this week's for an exercise you gain more first-hand experience with it as well but why does it really work what I want to do in the next video is give you some better intuition about what dropout really is doing let's go on to the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2x5Z9YT Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 52 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches Dropout Regularization, a powerful technique to prevent overfitting in neural networks, and its implementation using inverted dropout, with a focus on supervised learning and deep learning techniques. By watching this video, learners can understand how to implement dropout regularization and prevent overfitting in their own neural network models. This technique is crucial in deep learning as it helps to improve the generalization of models.

Key Takeaways

Implement dropout in a neural network by generating a dropout vector for each layer
Set the activations of the dropped units to zero and scale up the remaining units by the inverse of the keep probability
Use inverted dropout by multiplying the activations of the dropped units by a random vector with the same shape as the activations
Train the algorithm using the inverted dropout technique
Make predictions at test time without using dropout

💡 Dropout regularization adds noise to hidden units to prevent overfitting, and inverting dropout during testing ensures the expected value of activations doesn't change

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train