Hyperparameter Importance | PyTorch Developer Day 2020
Skills:
ML Pipelines70%
Key Takeaways
Examines hyperparameter importance and how to tune them using Optuna in PyTorch
Full Transcript
[Music] hello my name is chrisman loomis and i'm an engineer preferred networks today i'll be presenting on hyper parameter importance so a little bit more about myself i live in japan and as mentioned have been working for a company called preferred networks in japan which is one of the premier industrial deep learning companies in japan previously i worked on the chainer team within preferred networks which is one of the precursors of pytorch and helped define the define by run algorithm although now i'm on the auto machine learning team specifically working on the uptuna team so to give you an overview of what we'll be talking about tonight the first thing we're going to talk about is what are hyper parameters and then from there we'll talk more about what the impact is on your performance of your algorithms and then talk about current usage how people currently usually tune hyperparameters after that we'll go into some of the criteria of how you could choose which hyper parameters you should tune and then talk more about optuna which is a particular example of a hyperparameter optimization framework which goes very well with python programs and in specific with pi torch and we'll look and see how it could look inside of pi torch itself then we'll talk about how many hyper parameters you should tune at a time and how you can go about choosing them so starting ahead with what are hyper parameters typically in the deep learning world you might think of hyperparameters as the number of layers or the number of units within each layer or perhaps the learning rate of your optimizer generally they are controlling the behavior of the algorithms and they're determining the performance of those algorithms how well they actually do and they're typically set manually and we'll say more about that in a minute but they also determine the success or failure of your overall algorithms so object detection with a bad threshold parameter you can see on the left can produce too many bounding boxes in this particular case of the threshold hyperparameter and with a good threshold hyperparameter can provide a much more clean resolution where you have one bounding box basically per object but these are not the only hyperparameters there might be more than you thought of before uh if you just take a look at the image itself there's the encoding that's used for the image also uh what the order the image sizes that's used or the jpeg decoder then within the neural network trainer there's the batch size what optimizer is chosen stochastic gradient descent atom momentum or others and then the learning rate that's used by that optimizer itself in this particular case since we're looking at a detector model there's all of the visual information for the cnn the backbone architecture whether vgg or resnet the kernel size that's used to go over the image batch normalization order and other things and then down at the hardware layer you might be looking at whether you want to use floating point 16 or fp32 or mixed precision or on a nvidia gpu you might be looking at what cuda kernel parameters you want to use but generally swimming in too many of these can then cause you'd have a different issue but let's take a look at the impact of how this works so we looked at a hyper parameter optimization paper that looked at the out review of the algorithms and applications and found that if you compare doing a random search with bayesian optimization compared to hyperband and bayesian observation with hyperband it could provide almost a 20 times speed up and this advantage persisted at the long time frames as well and could increase actually to up to a 50 plus times speed up so the hyper parameters make a great difference to what the overall performance is but if you look at the current practice in them there was a survey that was done in machine learning experimental methods at nurips 2019 and iclr 2020 and it found that of all the people who were working with programs that have hyper parameters the majority of them were doing uh either not tuning them manual tuning a random search and that only about six percent of the people were using a hyper parameter optimization framework so given the huge benefits that can be available by tuning the hyper parameters we think this is a real opportunity for improving the general performance of pi torch and deep learning so but the problem is that if you try to optimize all of those hyper parameters simultaneously you're going to run into the curse of dimensionality uh with all those hyper parameters tuning at the same time the search space becomes too highly dimensional and it will take a long time for any hyperparameter optimization framework to find the best hyper parameters so in order to combat this uh we took a look at it in paper uh for the efficient approach for assessing hyperparameter importance and hyperparameter importance is basically a way that you can take a look and find out which of the hyper parameters it is that makes the most difference to the overall performance of your algorithms and taking a look at this we then implemented this hyper parameter importance into the optuna framework which we believe is the next generation hyperparameter organization framework optimization framework because it allows you to then not only optimize your hyper parameters but using hyperparameters can help you to select the most important ones to work with so let's take a look at how this could look in pi torch so in the model definition basically then we have to define a trial this is doing a simple mnist and around on the fifth line you see the out features uses the input trial object that was put into the function and gives a trial suggest integer for the number of units in the layer between 4 and 128 and then also uses a categorical list of either relu or tan for the activation for each layer and then the next line provides a float for the dropout value which ranges from 0.2 to 0.5 and notice that all of these hyper parameters are actually defined within the actual code itself in a define by run kind of way where it's very intuitive to see what the range is and it's defined using pythonic syntax for easy troubleshooting and definition then as we gonna go on to the objective function you can see that within the objective function as well we need to have a trial which object which is passed and this objective function is then used by apptuna to review and evaluate what the value how well the trial performed and in this one we have an optimizer name which is also given by the trial object in suggest categorical which then picks uh the optimizer from a list of atom rs rms prompt and stochastic gradient descent and then the next line gives us a log uniform which is a float which varies logarithmically to give us the full range of possible learning rates between 10 to the negative fifth and 10 to the negative one so then looking at the results of this simple mnist example we find some things that maybe confirm what you might have guessed which is that learning rate is the most important hyper parameter but then the next most important hyper parameter at half the importance of the learning rate is actually the number of units in the very first layer the number of units in the second layer is much less important less than a fifth is important and then the third most important hyper parameter is which optimizer was used which you might have guessed was an important factor but again this is not as important as the number of units in the first layer and then beyond that we see the dropout layer in the first layer um is also one of the important hyper parameters and other hyper parameters are less so so how many hyper parameters do we recommend that you should pick from our experience for reasonable time versus performance about the top three to five hyper parameters are the best hyper parameters to focus on so the steps then for tuning and using hyper parameter importance is to start with uh basically tuning all the hyper parameters you think might matter for the first 100 or so trials to give yourself a solid baseline and to give optuna some time to search the hyperparameter space then pick about the three to five hyper parameters uh using hyperparameter importance to see which ones have the most impact on the performance of your algorithm and then run the rest of the trials with the time or compute that you have available to you and hopefully win so as you can see there's kind of a hyperparameter evolution so i think the first step is just using sort of the default or not tuning the hyper parameters and the next step is manually fidgeting with those hyper parameters to see which is the most important then after that maybe using a grid search and then hopefully using a hyperparameter optimizer like uptuna to systematically using bayesian optimization look for the best type of parameters and then finally uh hopefully using optuna leverage with hyperparameter importance so that you limit it down to the hyperparameters which have the most impact on your overall performance these are some of the resources that you can look at for more information there's tuna dot org which is the home page for up tuna our github at optunapptuna also there is an ecosystem presentation on using uptonu with pi torch which you can find by googling and then also we have the papers which i've referred to in this discussion so thank you for your attention and i hope that you found optuna interesting and have a good rest of your pytorch dev day thank you
Original Description
Hyperparameters are manual, often hard-coded, settings in programming, but many programmers don't use a hyperparameter optimizer. In this talk, engineer and business developer Crissman Loomis examines what hyperparameters are, how to find out what the most important hyperparameters for your PyTorch code are, and how to tune them using Optuna.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from PyTorch · PyTorch · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
What is PyTorch?
PyTorch
PyTorch Tutorial: A Quick Preview
PyTorch
PyTorch Summer Hackathon 2019
PyTorch
Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz
PyTorch
PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang
PyTorch
Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang
PyTorch
Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian
PyTorch
Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa
PyTorch
Introduction to Machine Learning for Developers at F8 2019
PyTorch
Powered by PyTorch at F8 2019
PyTorch
Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019
PyTorch
New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019
PyTorch
PyTorch Developer Conference 2018: Recap
PyTorch
PyTorch Developer Conference 2018: Keynote & Deep Dive
PyTorch
PyTorch Developer Conference 2018: Production & Research Sessions
PyTorch
PyTorch Developer Conference 2018: Cloud & Academia Sessions
PyTorch
PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel
PyTorch
PyTorch Developer Conference 2019 | Full Livestream
PyTorch
PyTorch Developer Conference 2019: Recap
PyTorch
PyTorch Developer Conference Keynote - Mike Schroepfer
PyTorch
What’s new in PyTorch 1.3 - Lin Qiao
PyTorch
PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan
PyTorch
Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo
PyTorch
Quantization - Dmytro Dzhulgakov
PyTorch
PyTorch ONNX Export Support - Lara Haidar, Microsoft
PyTorch
Apex - Michael Carilli, NVIDIA
PyTorch
Dataloader Design for PyTorch - Tongzhou Wang, MIT
PyTorch
Linear Algebra in PyTorch - Vishwak Srinivasan, CMU
PyTorch
PyTorch Mobile - David Reiss
PyTorch
Model Interpretability with Captum - Narine Kokhilkyan
PyTorch
Detectron2 - Next Gen Object Detection Library - Yuxin Wu
PyTorch
Speech Extensions to Fairseq - Dmytro Okhonko
PyTorch
PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook
PyTorch
PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu
PyTorch
PyTorch in Robotics - Yisong Yue, Caltech
PyTorch
StanfordNLP - Yuhao Zhang, Stanford
PyTorch
Sotabench for Reproducible Research - Robert Stojnic, Papers with Code
PyTorch
Collaborative Natural Language Inference - Sasha Rush, Cornell
PyTorch
Privacy Preserving AI - Andrew Trask, OpenMined
PyTorch
CrypTen - Laurens van der Maaten
PyTorch
PyTorch at Uber - Sidney Zhang, Uber
PyTorch
PyTorch at Tesla - Andrej Karpathy, Tesla
PyTorch
PyTorch at Microsoft - Saurabh Tiwary, Microsoft
PyTorch
PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs
PyTorch
PyTorch Developer Conference 2019 - Panel Discussion
PyTorch
Using deep learning and PyTorch to power next gen aircraft at Caltech
PyTorch
Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1
PyTorch
TorchScript and PyTorch JIT | Deep Dive
PyTorch
Announcing the PyTorch Global Summer Hackathon 2020
PyTorch
Opening Up the Black Box: Model Understanding with Captum and PyTorch
PyTorch
PyTorch Mobile Runtime for Android
PyTorch
Torchvision in 5 minutes
PyTorch
3D Deep Learning with PyTorch3D
PyTorch
What is Torchtext?
PyTorch
TorchAudio: A Quick Intro
PyTorch
PyTorch Mobile Runtime for iOS
PyTorch
PySlowFast: Deep learning with Video
PyTorch
PyTorch Pruning | How it's Made by Michela Paganini
PyTorch
Measuring Fairness in Machine Learning Systems
PyTorch
PyTorch for Hackathons
PyTorch
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI