Tuning Process (C2W3L01)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago
Skills: ML Pipelines70%

Key Takeaways

The video discusses the process of hyperparameter tuning in deep learning, including guidelines for systematically organizing the hyperparameter tuning process and tips for efficiently converging on a good setting of hyperparameters. The video covers topics such as the importance of different hyperparameters, random search vs grid search, and coarse-to-fine search.

Full Transcript

hi and welcome back you've seen by now that changin your net can involve setting a lot of different hyper parameters now how do you go about finding a good setting for these hyper parameters in this video I want to share with you some guidelines some tips how to systematically organize your hyper parameter tuning process which hopefully will make it more efficient for you to converge on a good setting of the hyper parameters one of the painful things about training deep nets is the sheer number of hyper parameters you have to deal with ranging from the learning rate alpha to the momentum term beta the using momentum or the hyper parameters for the atom optimization algorithm which were beta 1 beta 2 and epsilon maybe have to pick the number of layers maybe have to pick the number of hidden units for the different layers and maybe you want to use learning rate decay so you don't just use a single learning rate alpha and then of course you might need to choose the mini batch size so it turns out some of these parameters are more important than others for most learning applications I would say alpha the learning rate is the most important hyper parameter to tune other than alpha a few other high performances I tend to but maybe to Nick's would be maybe the momentum term I say is 0.9 is a good default and also tune the mini batch size to make sure that the optimization algorithm is running efficiently often also fit around the hidden units of the ones I've circled in orange these are really the three that would consider second in importance to the learning rate alpha and then third in importance you know after sitting around the others the number of layers can sometimes make a huge difference and so can learning rate decay and then when using the atom algorithm I actually pretty much never tune beta-1 beta-2 an epsilon pretty much always used point nine point nine nine nine and ten to the minus eight although you can try tuning those as well if you wish but hopefully does give you some rough sense of what type of parameters might be more important than others alpha most important for sure follow maybe by the ones I've circled in orange follow maybe by the ones I circled in purple but this isn't a hard and fast rule and I think other deep learning practitioners may well disagree with you all have different intuitions on these now if you're trying to tune some set of high preferences how do you select the set of values to explore in earlier generations of machine learning algorithms if you had to hyper parameters which I'm calling how to prime to one and have a ground to to here it was common practice to sample the points you know in a grid like so and systematically explore these values here I'm placing down a five by five grid in practice it could be more or less than five five grid but you try out in this example or twenty five points and then you know pick whichever hyper parameter works best and this practice works okay when the number of hyperparameters was relatively small indeed learning what we tend to do and what I recommend you do instead is choose the points at random so go ahead and you know choose maybe your same number of points all right 25 points and then try out the hyper parameters on this randomly chosen set of points and the reason you do that is that it's difficult to know in advance which hyper parameters are going to be the most important for your problem and as you saw in the previous slide some hyper parameters are actually much more important than others so to take an example let's say hyper parameter one turns out to be alpha the learning rate and to take an extreme example let's say that hyper parameter two was that value epsilon that you have in the denominator of the atom algorithm so your choice of alpha matters a lot in your choice of epsilon hardly matters so if you sample in a grid then you've really tried out five values of alpha and you might find that all of the different values of epsilon gives you essentially the same answer so you've now trained 25 models and only got them to try out five values for the learning rate alpha which is the thing that's really important whereas in contrast if you were to sample a random then you know you all have tried out twenty-five distinct values of the learning rate alpha and therefore you'd be more likely to find a value that works really well I've explained this example using just two hyper parameters in practice you might be searching over many more hyper parameters than this so if you have safety hyper parameters I guess instead searching over a square you're searching over a cube where this third dimension is hyper parameter three and then by sampling within this you know three dimensional tube you get to try out a lot more values of each of your three high parameters and in practice you might be searching over even more hyper parameters than three and sometimes it's just hard to know in advance which ones turn out to be the really important high parameters for your application and something random rather than in a grid and it shows that you're more richly exploring the pot set of possible values for the most important hyper parameters whether they turn out to be when you stand for hyper parameters another common practice is to use a course to find something scheme so let's say in this two-dimensional example that you've sampled these points and maybe you found that this point work the best it may be a few other points around it tended to work really well then in the course defined scheme what you might do is zoom in to a smaller region of the hyper parameters and then sample more densely within this space or maybe a gain at random but to then focus more resources on searching within this blue square if you're suspecting that the best setting of the hyper parameters may be in this region so after doing a core sample of this entire square that tells you to then focus on on a smaller square you can then stop pull more densely in this smallest square so this type of a course to find search is also frequently use and by trying out these different values of the high parameters you can then pick whatever value allows you to do best on your training set objective or does best on your development sets or whatever you're trying to optimize in your hyper parameter search process so hope this gives you a way to more systematically organize your hyper parameter search process the two key takeaways are used random something not a grid search and consider optionally but consider implementing a course defined search process but there's even more to hyper parameter search than this let's talk more in the next video about how to choose the right scale on which to sample your hyper parameters

Original Description

Take the Deep Learning Specialization: http://bit.ly/2TvWKhI Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 16 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches the importance of hyperparameter tuning in deep learning and provides guidelines for systematically organizing the hyperparameter tuning process. It covers topics such as random search vs grid search and coarse-to-fine search, and provides tips for efficiently converging on a good setting of hyperparameters.

Key Takeaways
  1. Identify the most important hyperparameters for your deep learning model
  2. Choose a search method (random or grid search)
  3. Implement a coarse-to-fine search process
  4. Evaluate the performance of your model with different hyperparameter settings
  5. Select the best hyperparameter setting based on your evaluation metric
💡 Random search is often more efficient than grid search for hyperparameter tuning, especially when there are many hyperparameters to tune.

Related AI Lessons

Data privacy in AI training: federated learning, differential privacy, and synthetic data
Learn how federated learning, differential privacy, and synthetic data preserve data privacy in AI training, and why they matter for secure machine learning
Dev.to AI
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data by encoding and scaling features for better machine learning model performance
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training
Medium · Data Science
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training
Medium · Python
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →