Tuning Process (C2W3L01)
Skills:
ML Pipelines70%
Key Takeaways
The video discusses the process of hyperparameter tuning in deep learning, including guidelines for systematically organizing the hyperparameter tuning process and tips for efficiently converging on a good setting of hyperparameters. The video covers topics such as the importance of different hyperparameters, random search vs grid search, and coarse-to-fine search.
Full Transcript
hi and welcome back you've seen by now that changin your net can involve setting a lot of different hyper parameters now how do you go about finding a good setting for these hyper parameters in this video I want to share with you some guidelines some tips how to systematically organize your hyper parameter tuning process which hopefully will make it more efficient for you to converge on a good setting of the hyper parameters one of the painful things about training deep nets is the sheer number of hyper parameters you have to deal with ranging from the learning rate alpha to the momentum term beta the using momentum or the hyper parameters for the atom optimization algorithm which were beta 1 beta 2 and epsilon maybe have to pick the number of layers maybe have to pick the number of hidden units for the different layers and maybe you want to use learning rate decay so you don't just use a single learning rate alpha and then of course you might need to choose the mini batch size so it turns out some of these parameters are more important than others for most learning applications I would say alpha the learning rate is the most important hyper parameter to tune other than alpha a few other high performances I tend to but maybe to Nick's would be maybe the momentum term I say is 0.9 is a good default and also tune the mini batch size to make sure that the optimization algorithm is running efficiently often also fit around the hidden units of the ones I've circled in orange these are really the three that would consider second in importance to the learning rate alpha and then third in importance you know after sitting around the others the number of layers can sometimes make a huge difference and so can learning rate decay and then when using the atom algorithm I actually pretty much never tune beta-1 beta-2 an epsilon pretty much always used point nine point nine nine nine and ten to the minus eight although you can try tuning those as well if you wish but hopefully does give you some rough sense of what type of parameters might be more important than others alpha most important for sure follow maybe by the ones I've circled in orange follow maybe by the ones I circled in purple but this isn't a hard and fast rule and I think other deep learning practitioners may well disagree with you all have different intuitions on these now if you're trying to tune some set of high preferences how do you select the set of values to explore in earlier generations of machine learning algorithms if you had to hyper parameters which I'm calling how to prime to one and have a ground to to here it was common practice to sample the points you know in a grid like so and systematically explore these values here I'm placing down a five by five grid in practice it could be more or less than five five grid but you try out in this example or twenty five points and then you know pick whichever hyper parameter works best and this practice works okay when the number of hyperparameters was relatively small indeed learning what we tend to do and what I recommend you do instead is choose the points at random so go ahead and you know choose maybe your same number of points all right 25 points and then try out the hyper parameters on this randomly chosen set of points and the reason you do that is that it's difficult to know in advance which hyper parameters are going to be the most important for your problem and as you saw in the previous slide some hyper parameters are actually much more important than others so to take an example let's say hyper parameter one turns out to be alpha the learning rate and to take an extreme example let's say that hyper parameter two was that value epsilon that you have in the denominator of the atom algorithm so your choice of alpha matters a lot in your choice of epsilon hardly matters so if you sample in a grid then you've really tried out five values of alpha and you might find that all of the different values of epsilon gives you essentially the same answer so you've now trained 25 models and only got them to try out five values for the learning rate alpha which is the thing that's really important whereas in contrast if you were to sample a random then you know you all have tried out twenty-five distinct values of the learning rate alpha and therefore you'd be more likely to find a value that works really well I've explained this example using just two hyper parameters in practice you might be searching over many more hyper parameters than this so if you have safety hyper parameters I guess instead searching over a square you're searching over a cube where this third dimension is hyper parameter three and then by sampling within this you know three dimensional tube you get to try out a lot more values of each of your three high parameters and in practice you might be searching over even more hyper parameters than three and sometimes it's just hard to know in advance which ones turn out to be the really important high parameters for your application and something random rather than in a grid and it shows that you're more richly exploring the pot set of possible values for the most important hyper parameters whether they turn out to be when you stand for hyper parameters another common practice is to use a course to find something scheme so let's say in this two-dimensional example that you've sampled these points and maybe you found that this point work the best it may be a few other points around it tended to work really well then in the course defined scheme what you might do is zoom in to a smaller region of the hyper parameters and then sample more densely within this space or maybe a gain at random but to then focus more resources on searching within this blue square if you're suspecting that the best setting of the hyper parameters may be in this region so after doing a core sample of this entire square that tells you to then focus on on a smaller square you can then stop pull more densely in this smallest square so this type of a course to find search is also frequently use and by trying out these different values of the high parameters you can then pick whatever value allows you to do best on your training set objective or does best on your development sets or whatever you're trying to optimize in your hyper parameter search process so hope this gives you a way to more systematically organize your hyper parameter search process the two key takeaways are used random something not a grid search and consider optionally but consider implementing a course defined search process but there's even more to hyper parameter search than this let's talk more in the next video about how to choose the right scale on which to sample your hyper parameters
Original Description
Take the Deep Learning Specialization: http://bit.ly/2TvWKhI
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 16 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
▶
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · AI
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · Data Science
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · Deep Learning
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · LLM
🎓
Tutor Explanation
DeepCamp AI