Other Regularization Methods (C2W1L08)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

The video discusses other regularization methods, including data augmentation, early stopping, and L2 regularization, to prevent overfitting in neural networks.

Full Transcript

in addition to l2 regularization and drop our regularization their few other techniques for reducing over sitting in your neural network let's take a look let's say you're fitting a CAD classifier if you are overfitting getting more training data can help but getting more training data can be expensive and sometimes just can't get more data but what you can do is augment your training set by taking an image like this and for example flipping horizontally and adding that also to training set so now instead of just this one example when your training set you can add this to your training example so by flipping your images horizontally you could you know really double the size your training set because your training set is now a bit redundant this isn't as good as if you had collected an additional M set of brand new independent examples but you could do this without needing to pay the expense of going out to take me click more pictures of cats and then other than surfing horizontally you can also take random props of the image so here we've rotated and so a randomly zoom into the invention this pool looks like a tag but so by taking random distortions and transformations in the image you can augment your data set and make additional fake training examples again these extra fake training examples they don't add as much information as you were to go on and get a brand new independent example of a cat but because you could do this you know almost a free other than for some computational cost or this can be an inexpensive way to give your data this can be an inexpensive way to give your algorithm more data and therefore no sort of regular eyes it and reduce all the 15 and by synthesizing examples like this what you're really telling your algorithm is that is something that's a cat then slipping on horizontally is still account notice eyes inserted vertically because maybe we don't want upside-down cats right and then also maybe randomly zooming and are the inventions pretty slow account for optical character recognition you can also open your data set by take a digit and imposing random rotations and distortions to so if you add these things to your training set you know these are also still digit fours for illustration I applied a very strong distortion so this looks very way before in practice you don't need to distort the for quite as aggressively but just a more subtle distortion than what I'm showing here to make this example clearer for you right but the most subtle distortion is usually used in practice because this looks like really warp divorce so data augmentation can be used as regularization techniques and effects similar to regularization there's one other technique that is often used called early stopping so what you're going to do is as you run gradient descent you're going to plot your either training error or your zero one classification error on the training set or just plot the cost function J optimizing and that should decrease monotonically like so all right because as you train hopefully you're trading around your cost function J chikki please so what's early stopping what you do is you plot this and you'll also plot your def set error and again this could be a classification error and development variable something like the cost function like the logistic loss of the log loss of evaluation or death's-head now once you find is that your death set error will usually go down for a while and then it will increase from there so what early stopping does is you say well it looks like your new network is doing best around that elevation so we're just going to stop training on your network halfway and you'll take one of the value achieved this dead set error so why does this work well when you haven't run many iterations for your neural network yet your parameters W will be close to zero because you know with random initialization you probably initialize W to small random values so before you train for a long time W is still quite small and that's the integrate as you train W get bigger and bigger and bigger and so here maybe you have a much larger value of the parameters W for your neural network so what early stopping does is by stopping halfway you have only a you know mid size right w I'm so similar to l2 regularization by picking a new network was smaller norm for your parameters W hopefully your new network is overfitting less and the term early stopping refers to the fact that you're just stopping the training of your new network early I sometimes use early stopping when training on your network but it does have one downside let me explain I think the machine learning process as comprising several different steps one is that you want an algorithm so optimize the cost function J and we have various tools to do that you know such as gradient descent and then we'll talk later about other algorithms like momentum and algorithm and rmsprop and atom and so on but then after optimizing the cost function J as you also wanted to not over fit and we have some tools to do that such as your regularization getting more data and so on now in machine learning we already have so many hyper parameters to search over is already very complicated to choose among the space of possible algorithms and so I find machine learning easier to think about when you have one set of tools for optimizing the cost function J and when you're focusing on authorizing the cost function J all you care about is finding W and B so that J of W B is as small as possible you just don't think about anything else other than producing this and then it's completely separate tasks to not overstate in other words to reduce the Arians and when you're doing that you have a second set of tools of doing it and this principle is sometimes called orthogonalization and this is idea that you want to think about one task at a time I'll see you more about also organization in a later video so if you don't fully get the concept yet don't worry about it but to me the main downside is early swapping is that this couples these two toss so you no longer can work on these two problems independently because by stopping gradient descent early you're sort of breaking whatever you're doing to optimize the cost function J because now you're not doing a jar reducing the cost function genius or not done that that well and then you're also simultaneously trying to not overstate so instead of using different tools to solve the two problems you're using one two they kind of mix us the two and this just makes the set of things you could try a more complicated to think about rather than you think early stopping one alternative is just use l2 regularization then you can just train the neural network as long as possible I find it this makes the search space of type of parameters easier to decompose and each of the search over but the downside of this though is that you might have to try a lot of values of the regularization parameter lambda and so this makes searching over many values of lambda more computationally expensive and the real advantage of early stopping is that running the gradient descent process just once you get to try out values of small W midsize W at large W without needing to try a lot of values of the regularization LT regularization hybrid parameter lambda um if this concept doesn't completely make sense yet don't worry about it we'll talk about orthogonalization in greater detail in the later video I think this would make a bit more sense this presence disadvantages many people do use it I personally prefer to just use l2 regularization and try different values of lambda that's assuming you can afford a computation to do so but early stopping does let you get a similar effect without needing to explicitly try lots of different values of lambda so you've now seen how to use data augmentation as well as if you wish early stopping in order to reduce variance and prevent overfitting is in your network next let's talk about some techniques for such an optimization problem to make your training go quickly

Original Description

Take the Deep Learning Specialization: http://bit.ly/3cAd49Y Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 46 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches how to use data augmentation, early stopping, and L2 regularization to prevent overfitting in neural networks, and how to apply these techniques to improve model performance. The video is part of the Deep Learning Specialization and covers key concepts in machine learning and deep learning.

Key Takeaways
  1. Flip images horizontally to double the size of the training set
  2. Apply random rotations and distortions to images to create fake training examples
  3. Plot the training error or cost function against the number of iterations to monitor convergence
  4. Stop training when the model's performance on the validation set starts to degrade
  5. Try out values of small W, mid-size W, and large W without needing to try many values of the regularization parameter lambda
  6. Use L2 regularization and try different values of lambda
  7. Use early stopping
💡 Regularization techniques such as data augmentation, early stopping, and L2 regularization can be used to prevent overfitting in neural networks and improve model performance.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →