Other Regularization Methods (C2W1L08)
Key Takeaways
The video discusses other regularization methods, including data augmentation, early stopping, and L2 regularization, to prevent overfitting in neural networks.
Full Transcript
in addition to l2 regularization and drop our regularization their few other techniques for reducing over sitting in your neural network let's take a look let's say you're fitting a CAD classifier if you are overfitting getting more training data can help but getting more training data can be expensive and sometimes just can't get more data but what you can do is augment your training set by taking an image like this and for example flipping horizontally and adding that also to training set so now instead of just this one example when your training set you can add this to your training example so by flipping your images horizontally you could you know really double the size your training set because your training set is now a bit redundant this isn't as good as if you had collected an additional M set of brand new independent examples but you could do this without needing to pay the expense of going out to take me click more pictures of cats and then other than surfing horizontally you can also take random props of the image so here we've rotated and so a randomly zoom into the invention this pool looks like a tag but so by taking random distortions and transformations in the image you can augment your data set and make additional fake training examples again these extra fake training examples they don't add as much information as you were to go on and get a brand new independent example of a cat but because you could do this you know almost a free other than for some computational cost or this can be an inexpensive way to give your data this can be an inexpensive way to give your algorithm more data and therefore no sort of regular eyes it and reduce all the 15 and by synthesizing examples like this what you're really telling your algorithm is that is something that's a cat then slipping on horizontally is still account notice eyes inserted vertically because maybe we don't want upside-down cats right and then also maybe randomly zooming and are the inventions pretty slow account for optical character recognition you can also open your data set by take a digit and imposing random rotations and distortions to so if you add these things to your training set you know these are also still digit fours for illustration I applied a very strong distortion so this looks very way before in practice you don't need to distort the for quite as aggressively but just a more subtle distortion than what I'm showing here to make this example clearer for you right but the most subtle distortion is usually used in practice because this looks like really warp divorce so data augmentation can be used as regularization techniques and effects similar to regularization there's one other technique that is often used called early stopping so what you're going to do is as you run gradient descent you're going to plot your either training error or your zero one classification error on the training set or just plot the cost function J optimizing and that should decrease monotonically like so all right because as you train hopefully you're trading around your cost function J chikki please so what's early stopping what you do is you plot this and you'll also plot your def set error and again this could be a classification error and development variable something like the cost function like the logistic loss of the log loss of evaluation or death's-head now once you find is that your death set error will usually go down for a while and then it will increase from there so what early stopping does is you say well it looks like your new network is doing best around that elevation so we're just going to stop training on your network halfway and you'll take one of the value achieved this dead set error so why does this work well when you haven't run many iterations for your neural network yet your parameters W will be close to zero because you know with random initialization you probably initialize W to small random values so before you train for a long time W is still quite small and that's the integrate as you train W get bigger and bigger and bigger and so here maybe you have a much larger value of the parameters W for your neural network so what early stopping does is by stopping halfway you have only a you know mid size right w I'm so similar to l2 regularization by picking a new network was smaller norm for your parameters W hopefully your new network is overfitting less and the term early stopping refers to the fact that you're just stopping the training of your new network early I sometimes use early stopping when training on your network but it does have one downside let me explain I think the machine learning process as comprising several different steps one is that you want an algorithm so optimize the cost function J and we have various tools to do that you know such as gradient descent and then we'll talk later about other algorithms like momentum and algorithm and rmsprop and atom and so on but then after optimizing the cost function J as you also wanted to not over fit and we have some tools to do that such as your regularization getting more data and so on now in machine learning we already have so many hyper parameters to search over is already very complicated to choose among the space of possible algorithms and so I find machine learning easier to think about when you have one set of tools for optimizing the cost function J and when you're focusing on authorizing the cost function J all you care about is finding W and B so that J of W B is as small as possible you just don't think about anything else other than producing this and then it's completely separate tasks to not overstate in other words to reduce the Arians and when you're doing that you have a second set of tools of doing it and this principle is sometimes called orthogonalization and this is idea that you want to think about one task at a time I'll see you more about also organization in a later video so if you don't fully get the concept yet don't worry about it but to me the main downside is early swapping is that this couples these two toss so you no longer can work on these two problems independently because by stopping gradient descent early you're sort of breaking whatever you're doing to optimize the cost function J because now you're not doing a jar reducing the cost function genius or not done that that well and then you're also simultaneously trying to not overstate so instead of using different tools to solve the two problems you're using one two they kind of mix us the two and this just makes the set of things you could try a more complicated to think about rather than you think early stopping one alternative is just use l2 regularization then you can just train the neural network as long as possible I find it this makes the search space of type of parameters easier to decompose and each of the search over but the downside of this though is that you might have to try a lot of values of the regularization parameter lambda and so this makes searching over many values of lambda more computationally expensive and the real advantage of early stopping is that running the gradient descent process just once you get to try out values of small W midsize W at large W without needing to try a lot of values of the regularization LT regularization hybrid parameter lambda um if this concept doesn't completely make sense yet don't worry about it we'll talk about orthogonalization in greater detail in the later video I think this would make a bit more sense this presence disadvantages many people do use it I personally prefer to just use l2 regularization and try different values of lambda that's assuming you can afford a computation to do so but early stopping does let you get a similar effect without needing to explicitly try lots of different values of lambda so you've now seen how to use data augmentation as well as if you wish early stopping in order to reduce variance and prevent overfitting is in your network next let's talk about some techniques for such an optimization problem to make your training go quickly
Original Description
Take the Deep Learning Specialization: http://bit.ly/3cAd49Y
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 46 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
▶
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI