Tuning Process (C2W3L01)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Pipelines70%

Key Takeaways

The video discusses the process of hyperparameter tuning in deep learning, including guidelines for systematically organizing the hyperparameter tuning process and tips for efficiently converging on a good setting of hyperparameters. The video covers topics such as the importance of different hyperparameters, random search vs grid search, and coarse-to-fine search.

Full Transcript

hi and welcome back you've seen by now that changin your net can involve setting a lot of different hyper parameters now how do you go about finding a good setting for these hyper parameters in this video I want to share with you some guidelines some tips how to systematically organize your hyper parameter tuning process which hopefully will make it more efficient for you to converge on a good setting of the hyper parameters one of the painful things about training deep nets is the sheer number of hyper parameters you have to deal with ranging from the learning rate alpha to the momentum term beta the using momentum or the hyper parameters for the atom optimization algorithm which were beta 1 beta 2 and epsilon maybe have to pick the number of layers maybe have to pick the number of hidden units for the different layers and maybe you want to use learning rate decay so you don't just use a single learning rate alpha and then of course you might need to choose the mini batch size so it turns out some of these parameters are more important than others for most learning applications I would say alpha the learning rate is the most important hyper parameter to tune other than alpha a few other high performances I tend to but maybe to Nick's would be maybe the momentum term I say is 0.9 is a good default and also tune the mini batch size to make sure that the optimization algorithm is running efficiently often also fit around the hidden units of the ones I've circled in orange these are really the three that would consider second in importance to the learning rate alpha and then third in importance you know after sitting around the others the number of layers can sometimes make a huge difference and so can learning rate decay and then when using the atom algorithm I actually pretty much never tune beta-1 beta-2 an epsilon pretty much always used point nine point nine nine nine and ten to the minus eight although you can try tuning those as well if you wish but hopefully does give you some rough sense of what type of parameters might be more important than others alpha most important for sure follow maybe by the ones I've circled in orange follow maybe by the ones I circled in purple but this isn't a hard and fast rule and I think other deep learning practitioners may well disagree with you all have different intuitions on these now if you're trying to tune some set of high preferences how do you select the set of values to explore in earlier generations of machine learning algorithms if you had to hyper parameters which I'm calling how to prime to one and have a ground to to here it was common practice to sample the points you know in a grid like so and systematically explore these values here I'm placing down a five by five grid in practice it could be more or less than five five grid but you try out in this example or twenty five points and then you know pick whichever hyper parameter works best and this practice works okay when the number of hyperparameters was relatively small indeed learning what we tend to do and what I recommend you do instead is choose the points at random so go ahead and you know choose maybe your same number of points all right 25 points and then try out the hyper parameters on this randomly chosen set of points and the reason you do that is that it's difficult to know in advance which hyper parameters are going to be the most important for your problem and as you saw in the previous slide some hyper parameters are actually much more important than others so to take an example let's say hyper parameter one turns out to be alpha the learning rate and to take an extreme example let's say that hyper parameter two was that value epsilon that you have in the denominator of the atom algorithm so your choice of alpha matters a lot in your choice of epsilon hardly matters so if you sample in a grid then you've really tried out five values of alpha and you might find that all of the different values of epsilon gives you essentially the same answer so you've now trained 25 models and only got them to try out five values for the learning rate alpha which is the thing that's really important whereas in contrast if you were to sample a random then you know you all have tried out twenty-five distinct values of the learning rate alpha and therefore you'd be more likely to find a value that works really well I've explained this example using just two hyper parameters in practice you might be searching over many more hyper parameters than this so if you have safety hyper parameters I guess instead searching over a square you're searching over a cube where this third dimension is hyper parameter three and then by sampling within this you know three dimensional tube you get to try out a lot more values of each of your three high parameters and in practice you might be searching over even more hyper parameters than three and sometimes it's just hard to know in advance which ones turn out to be the really important high parameters for your application and something random rather than in a grid and it shows that you're more richly exploring the pot set of possible values for the most important hyper parameters whether they turn out to be when you stand for hyper parameters another common practice is to use a course to find something scheme so let's say in this two-dimensional example that you've sampled these points and maybe you found that this point work the best it may be a few other points around it tended to work really well then in the course defined scheme what you might do is zoom in to a smaller region of the hyper parameters and then sample more densely within this space or maybe a gain at random but to then focus more resources on searching within this blue square if you're suspecting that the best setting of the hyper parameters may be in this region so after doing a core sample of this entire square that tells you to then focus on on a smaller square you can then stop pull more densely in this smallest square so this type of a course to find search is also frequently use and by trying out these different values of the high parameters you can then pick whatever value allows you to do best on your training set objective or does best on your development sets or whatever you're trying to optimize in your hyper parameter search process so hope this gives you a way to more systematically organize your hyper parameter search process the two key takeaways are used random something not a grid search and consider optionally but consider implementing a course defined search process but there's even more to hyper parameter search than this let's talk more in the next video about how to choose the right scale on which to sample your hyper parameters

Original Description

Take the Deep Learning Specialization: http://bit.ly/2TvWKhI Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 16 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches the importance of hyperparameter tuning in deep learning and provides guidelines for systematically organizing the hyperparameter tuning process. It covers topics such as random search vs grid search and coarse-to-fine search, and provides tips for efficiently converging on a good setting of hyperparameters.

Key Takeaways

Identify the most important hyperparameters for your deep learning model
Choose a search method (random or grid search)
Implement a coarse-to-fine search process
Evaluate the performance of your model with different hyperparameter settings
Select the best hyperparameter setting based on your evaluation metric

💡 Random search is often more efficient than grid search for hyperparameter tuning, especially when there are many hyperparameters to tune.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

Data privacy in AI training: federated learning, differential privacy, and synthetic data

Learn how federated learning, differential privacy, and synthetic data preserve data privacy in AI training, and why they matter for secure machine learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Data Science

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB