Directions in ML: "Neural architecture search: Coming of age"

Microsoft Research · Advanced ·📄 Research Papers Explained ·5y ago

Key Takeaways

The video discusses Neural Architecture Search (NAS) and its recent advancements, including benchmarks, best practices, and open-source frameworks, as well as various techniques and tools used in NAS, such as DARTS, NAS-Bench-101, and surrogate models.

Full Transcript

welcome um i'm nicola fousey i'm a researcher in microsoft research in new england uh in cambridge massachusetts and i want to welcome you to this talk which is presented by the ottoman research community at msr frank's talk is the second in our series this year on directions in machine learning automl and automating algorithms for more information including our speaker lineup and video archive you can go to aka dot ms slash demo written as d-i-m-l so frank is a full professor of machine learning at the university of freiburg in germany as well the chief expert otamel at the bosch center for artificial intelligence he holds a phd from the university of british columbia and a master of science from tu darmstadt in the field of autumnal frank quarter the icml workshop series knows ml in 2014 and has co-organized it every year since he also quoted the prominent automl tools auto weka and auto sklearn won the first two automotive challenges with his themes coter the first book on automl worked extensively on efficient type of parameter optimization and new architecture search and gave an europe's 2018 tutorial with over 3000 attendees today he's going to talk about neural architecture search coming of age this talk is going to be about 45 minutes long and it will we will have a 15 minutes live q a after the recording of this talk will be publicly available but the q a will not be recorded as it's for attending the live event only so feel free to write questions in the chat as you think of them and frank will either answer them live in the chat or we'll save them for the moderated q a thank you and enjoy thanks nicolo um for the nice introduction and for inviting me to give a talk here so my talk today is titled neural architecture search coming of age and i basically going to give a description of several different works in neural architecture search that i think are really cool and how the field as such is well coming of age one quick note all these slides um are available at www.automl.org talks and all the references i have in the talk are hyperlinks so if you get the pdf you can get directly to all the different papers by clicking on them all right so to start off um let me mention that neural architecture search really is exploding um we wrote the survey in well end of 2018 published in 2019 when it was um yeah still a manageable field and the number of publications that came out in 2019 were already four times more than in 2018 and that factor already was there before and this is 2016 is when this paper neural architecture search for reinforcement learning was published at iclear by um sophen lee and that's really when neural architecture search hit mainstream and since then it's really exploding however neural architecture search is still in the process of overcoming some childhood problems it hasn't fully come off age yet so there was this iclear 2020 paper nas evaluation is frustratingly hard which showed that well just by playing with the different um yeah settings of your training pipeline you can get better and better and better and better performance so a total difference of something like three percent just by playing with your settings of the untraining pipeline such as adding drop drop pass and cut out and auxiliary towers and more channels and using auto augment and more epochs um and so this is three percent difference going from here to there and the architecture space well that's just something like one percent different or half a percent so um that doesn't shed a great light on your architecture search but um nevertheless neural architecture search has led to much better architectures being available now than previously another issue in your architecture search is that while the community is quite focused on tables like this with performance numbers like this and i would argue that all these numbers are just really incomparable due to this issue of the settings being really important because all of these different papers where with these being reported test errors in the different papers use different training code different hyper parameters different search spaces different evaluation schemes so this is really very incomparable and just how good a performance you get on c410 doesn't really mean much about how good is your meta algorithm your neural architecture circuit algorithm and i would very much advocate for disentangling these two and that brings me to the overview the first part that i'm going to talk about is actually about best practices and benchmarks for neural architecture search in order to really help the field completely come of age and be a scientific field with good principles the second part is about speed up techniques for neural architecture search because well the original were by by soft and lee or it's reinforcement learning well that that used 800 gpus for several weeks and well of course that is not something that we can do on an everyday basis for all kinds of different problems in academia and not even an industry so speed up techniques are very important and this will actually really be the biggest part of the talk um there will be four different speed up techniques i'll talk about um then i'll briefly talk about autopilot which helps us to actually jointly search in the space of neural architectures and hyper parameters in order to really give us um off-the-shelf auto ml so something that that can be fairly hands-free and finally i'll briefly talk about an extended problem formulation where we don't search for just a single architecture but for a whole ensemble of neural architectures all right let's jump right in and talk about benchmarks and best practices the first benchmark that we introduced back then in back then one year ago at icml was nasbench101 so this is a relatively small search space that we exhaustively evaluated in order to enable researchers to afterwards do work on neural architecture search where if you would normally query the performance of a neural architecture you now go and look it up in a table and so the table lookup takes um less than a second and you can then run architecture search experiments on the laptop in minutes rather than taking 800 gpus for weeks this clearly enables proper scientific research now we can do multiple runs now we can do robustness studies and get statistical significance and we can also do fair head-to-head evaluations by design because well you you don't have a possibility to actually tweak the training pipeline because well the values are just saved in the table and well of course the source code and the scripts are available for this as well for anyone to start from it and continue um a few more details so there we looked at 423 000 architectures this is a very small cell search space but um it's still all quite large in absolute terms so 400 000 networks being evaluated and the important thing is this is had to be done once and nobody has to run these 400 000 architectures again and this was really only possible with large-scale industrial resources we did this project together with google and used 4 000 tpus for several months so yeah thanks for all that compute time that was very helpful some things that we can do with such a benchmark is well actually properly compare different optimizers and we did that in the nas bench 101 paper and there we saw that well we compared hyperband and random search smack tpe regularized evolution bob and reinforcement learning and noticed that well regularized evolution actually is better than reinforcement learning is better than random search but bayesian optimization is um here just as good as regularized evolution and notice that the smack algorithm here that was actually published in 2011 and that outperforms rl that was published in 2016 um but of course um nobody would tell the people on stuff and lee in 2016 oh um can you please compare this against smack or compare this against these other 10 baselines because well 800 gpus for multiple weeks you can't run this many times and even their own algorithm they ran a single time 800 gpus for several weeks now if now that we have these tabular benchmarks we can actually trivially run all of these algorithms many times on each of the lines here is based on 500 runs on taking the mean of those and yeah we can do these scientific studies using these tabular benchmarks so they're really extremely useful um there has been follow-up tabular benchmarks um one actually um from my own group where we took subspaces of naspbench 101 that have been fully evaluated because nusbench 101 has this problem that well even though we had 423 000 architectures we couldn't cover an entire search space and completely but we had a constraint that we would only allow at most nine edges in this directed at graph that describes the cell architecture and um if you try to use 10 edges or more then well there's just no architecture evaluated for that and so you can't evaluate one-shot models that would require evaluations of all the possible instantiations of the search space and what we did in nassbang one short one is actually to look at subspaces of naspbench 101 that are fully evaluated and work for one shot method and there we have three different subspaces that have respectively six thousand and twenty nine thousand and three hundred sixty three thousand architectures so still quite large and those work for one shot methods and that's why it's called nas bench one shot one um at the same time uh nas bench two or one came out um by johnny dong and ninja and their approach was a bit different they actually came up with a new neural architecture benchmark that had very few unique architectures only 6000 compared to 420 000 but the search space was very nice very clearly aligned with the typical search spaces that you would use for one-shot methods and therefore it's very convenient to benchmark your one shot methods and what's also very nice is that they have three data sets you can do meta learning experiments on this and they have the entire learning curve logged so you can do learning curve extrapolation and things like that on this data set now one problem of all of these nas benchmarks is still that they're too small to be realistic even this 423 thousand architecture series is very very small compared to larger search spaces that are realistic and and indeed it has been shown that local search is actually state of the art for these tabular benchmarks but performs very poorly on large spaces so we're not quite at the point where we can really design our methods to work well on these tabular spaces and then be quite sure that they're going to work well on the real benchmarks and here comes the first contribution that i want to talk about a bit more this is an archive paper that actually came out today and it's not bench 301 and the new thing here is that this is really the first benchmark that's for a realistic nas space namely exactly the one of the darts paper so this has 10 to the 18 architectures compared to the well 10 to the how much is less than 10 to the 6 here um so more than a factor of 10 to the 12 more architectures and of course the space is ridiculously large we can't exhaustively evaluate um in this space and what we do rather is to actually fit a surrogate model rather than a tabular benchmark so this works as follows we just observe partial results we observe a lot of different architectures in the search space and then we put a regression model to use instead of using a table and this regression model can of course extrapolate to unseen architectures surprisingly maybe at first glance this can yield better estimates of the true performance than a tabular benchmark and the reason for that is as follows so tabular benchmarks actually still have noise and they're typically based on only one or very few runs per architecture and these runs well they're running sgd sgd is a stochastic algorithm and it yields quite different performance in different runs so you by by just trusting the result of a single run you're gonna have an approximation error of the true underlying performance of this architecture and if you view tabular benchmarks from a machine learning perspective from a statistical estimation perspective then basically what they are is their models that assume complete independence of the performance of all the different architectures and only put all that x in the one basket of this one run or the very few runs for an architecture a that we saw exactly before so um when you when you want to predict for or look at the performance of our architecture a then you look at exactly the run for architecture a and no other runs of similar architectures however we know that there is large noise in sgd and we know that the performance of similar architectures actually is similar so from a machine learning perspective it makes a lot of sense that a model that doesn't make this independence assumption could actually do better and smooth out this noise and precisely that's what we see here on the right this is actually running on nas bench 101 where we have actually three evaluations per architecture not just one and what we look at here is the error that you would make if you just trust one run by looking at how different are the other two runs from this one run so the two runs are the ground truths and the one run is the estimate and we compare this ground truth to this this estimate or to the estimate of a surrogate model fit on a different number of architectural runs with also just that one seed and if you use that one seed for all the architectures and you fit a model on it then your error is actually less than just trusting the single architecture um the single run of the architecture and the break-even point is actually somewhere at about 15 000 architectures that you need to evaluate out of these 423 cells and then then your surrogate model actually can generate can generalize to architectures that are not in this training set right so what did we do to build last band 301 we evaluated 50 000 architectures in the darts search space and we evaluated them using different optimizers because well we want to cover the whole search space roughly and we did that using random search with 24 000 evaluations and we wanted to cover good parts of the space and we didn't really want to put our all our eggs in one basket so we ran different methods such as evolution differential evolution regularized evolution tpe bananas combo and a variety of different one-shot methods in order to collect the architectures that we would then evaluate in order to to really find the good architectures in this space and because those are the ones that we will want to look at a lot when we do optimization in the space we then evaluated a broad range of different regression models to fit this data and what we found was that the best regression models were actually gradient boosting or graph convolutional neural networks so and the graph convolutional networks were better on some paths and the gradient will sing better on some other tasks and the details for that are are in the paper but the important thing to notice is that we can actually fit this data and just like for nas bench 101 our estimation errors are actually lower than the error that's due to the noise in a single run of sgd so even if we could fit a tabular benchmark on these 10 to the 18 different configurations the estimated error of that would be higher than the estimated error of our model and of course that we can only show that for the architectures that are in our test set but those are um yeah tens of thousands of architectures so um we trust it um quite a bit all right what can we do with last bench 301 well very similar things as with lastbench101 we can for example benchmark different math methods and so we can look at the performance of these methods on the true benchmark and here we have one run or we can look at the performance on the surrogate benchmark so here um a evaluating an architecture takes several hours here evaluating an architecture takes less than a second and nevertheless the evaluations the qualitative results are very similar so you you see that random search for example it is up here it's not doing very well also not doing very well here we can for the surrogates actually play time longer than for these for the true benchmark the true benchmark we couldn't run any longer because well actually this x scale here goes to 3 times 10 to the 7 seconds that's 30 million seconds which is um or 32 million seconds actually more than a gpu year and this is a single run of these neural architecture search methods um if you run random sequentially random and paralyzed fashion otherwise you couldn't have done that at all and here we can nevertheless say oh well what if we ran this 10 times longer we would get the following performance random search still wouldn't um go anywhere and um while re just got better here and then it would probably stay roughly there we see that here bananas based on optimization with neural networks does really well and the same is reflected also in the surrogate benchmarks tpe is not doing much better than random search same in the surrogate benchmarks d e is doing quite well in the end um this is the same as here and this is just the continuation so really very qualitatively similar results uh the one difference here in the in the surrogate benchmarks and the true benchmark is that the performance is smoother on the surrogates well then that is the case because we could only afford one run here and we have many runs here and report averages so overall benchmarking on the surrogate is really something that looks very similar to the performance on the original darts benchmark and we can do it quite cheaply now the availability of these different nas benchmarks combined with the note on mass evaluation as frustratingly hard that i mentioned in the beginning led us to propose these best practices for neural architecture search so we we very much encourage people to release their code not just the trained architectures but also code to actually do uh the training to to see what are the tricks that are being used um in in the different papers as we saw in the nasa evaluation is frustratingly hard this can make up to three percent difference um also releasing hyper parameters releasing seeds and so on and then properly comparing methods that includes using the same search spaces using the same nas benchmarks and using the same enough benchmarks doesn't mean using c410 but means using the same search space the same optimization pipeline and optimally the same hyperparameters and then also of course the same data set and the same training and validation test splits and everything that belongs to a nas benchmark and then you can actually do fair apples to apples comparisons and if you use some tabular surrogate benchmarks then then you actually get that part for free and you can actually do these experiments really quickly and you can get statistical significance on these benchmarks i'm not advocating to only use nas benchmarks for your evaluations it sometimes helps to make sure that everything does transfer to the true benchmarks but for all kinds of unit tests for all evaluations um in intermediate evaluations and for the evaluations in the paper where we want to get statistical significance i'm definitely advocating using a lot of tabula and surrogate benchmarks and the third type of um yeah best practices um concerns reporting important details such as how much hyper parameter tuning actually went into a method and how robust it would be when you run it on new data sets and so on so this checklist is an actual checklist um that's available um online and the one suggestion that i would make based on this to reviewers is to actually de-emphasize the final results table on c410 and other data sets look a little bit less at that and be aware of the many confounding factors that go into those final performance numbers all right so to summarize this on the first part i think we've gone a long way towards actually building a scientific community around here at architecture we now have benchmarks that can be used really cheaply also by students who don't have access to any big compute so this is really something where academia can do a lot now we have this best practice checklist we also organize the first workshop at iclear on your architecture search and one thing that we're working on where we have the first version is a library of ural architecture search that's modular and extensible and just implements all kinds of different methods and really enables clean empirical comparisons without confounding factors great so um so much for benchmarks and best practices and i'll now transition to the second part of the talk on speed up techniques for neural architecture search as i mentioned there will be four different types of techniques here the first one is on weight inheritance and network morphisms so network morphisms are operators that change the network structure but not the modeled function so that means for every input the network yields the same output as before applying the network morphers so you have a network and then you put in a layer that makes this one and say this layer here is just an identity mapping and then the function that's computed by this modified architecture is still the same as before and you can use these network morphisms in your architecture search algorithms as operations to generate new networks that you then don't need to um train from scratch but for which you have pre-trained networks available that you can then just fine-tune and this allows doing fast architecture search as follows so you start from some model um that has a certain performance and i'm just showing here two large layers and two small layers and it has a certain performance and you apply different network morphisms for example making one layer wider adding a layer and maybe adding a skip connection because these are network morphisms the performance is going to stay exactly the same 82 then you do some fine tuning and the performance changes and then you simply pick the model that has the best performance and iterate this process and that actually allows you to do surprisingly fast architecture search through these local operations so something like architecture search in 12 hours on a single gpu one issue with this however is that the network morph systems only ever make your architecture larger and you're you're never going to get small architectures with good performance so in order to get those we had this follow-up work on um at last year's iclear on efficient multi-objective architecture search through network morphism so here we want to trade off for example the network size versus its error and and we also had um different measures in the paper um such as um yeah the number of flops and and um multiple multiply add and um yeah the well the size i already had here and also performance on different benchmarks such as fifa 10 and c400 so the largest experiment we had actually had five um objectives and we're interested in a pareto front in this case here of two objectives the number of parameters and the validation error and we initialize this with some default networks one architecture here one there one there one there one there there one there one there and get this pareto front where um all the points above are dominated by a member of the pareto front so for example a point here would have a validation error that's worse than this point and would have a number of parameters that's worse than this point so this we don't need to care about but all the members of the pareto front those are the models we would give to the decision maker and say well these are your options these architectures are not dominated and what we do here is evolve a population of for retro-optimal architectures over time so we start with this generation with this population and over time we actually get better performance or lower validation error and we also get a lower number of parameters so we push this hurry to front from the first generation over time to the lower left and um yeah do proper multi-objective optimization uh the resulting algorithm is called lemonade for lamarckian evolution for multi-objective neural architecture design the lamarckian part that is the inheriting the weights from your parents and this algorithm is still cheap um it takes something like a week on eight gpus well relatively cheap for doing multi-objective optimization all right so far for the network morphisms just to mention that of course you also can compare the resulting networks to other mobile-sized networks and here we have the pareto front of lemonade compared to nasnet this work on reinforcement learning for neural architecture search by soft and lee and also different versions of mobilenet and you get better performance and in particular compared to naznet are the searches 35 times faster by using these network morphs the same also you also get transfer to imagenet so no big surprises there all right so this was weight inheritance and network morph essence the next speed-up technique i want to talk about is about weight sharing and one-shot models and the most popular approach in that domain is the darts algorithm or it stands for differentiable architecture search by leo adele um also at iclear2019 and this works as follows so you have this discrete architecture search problem where you have operations on the edges and you want to choose between different operations for each of these edges in particular you have this option the red operation or the green or the blue for example a three by three con five by five con for max pooling etc and what darts does it relaxes this discrete last problem where you only can choose one thing to say well i'm just gonna use all the different operations between these edges and i'm going to put weights on them so that yields a mixed operator between node i and j that is just a weighted sum of the operations um of the individual operations and the weights are encapsulated by these alphas so the alphas are these architectural parameters and then you can actually go and and use gradient based optimization in order to optimize these alphas because the alphas are continuous values and so you solve a bi-level optimization problem you look for the alpha that gives you the minimal validation loss when you combine it with weights that are optimized for this particular alpha um weight optimized for this alpha by optimizing training training loss so um that that is a typical pipeline for for hyperparameter optimization in your architecture search we look at the validation error of architectures that whose weights are trained on the training loss and so um that's a natural formulation that's by level optimization to find the architectures that give you the best performance with these optimized weights for them and that goes from here to there so some of these edges then become a lot stronger and some of these edges become much weaker and the last step in darts is to actually then discretize each of these edges again and drop the least important parts so the the lowest alphas the edges of those just go away and only the strongest alphas remain and then you have just a single architecture not this one model anymore all right so that's a darts approach and this is very popular because well you can actually solve this by level optimization problem by just doing a gradient step on the alphas followed by a gradient step on the weights followed by a gradient step on the alpha and a gradient step on the weight so a very simple algorithm there's no proofs of convergence or anything but but that actually works in practice to find good architectures and weights and because of its simplicity and because the code is available and this this has actually really had a big impact in neural architecture search now um a few things we did with starts is we we first actually had an application tuning a vision pipeline uh for depth from stereo where we have a left image and a right image and we just want to get this ground truth image and well actually this is not ground truth this is our result um so very nice results that that we can get by darts and in particular we reduced the endpoint error on zintel by 10 from the previous state of the art so um quite nice uh improvements and the previous state of the art this has worked together with thomas brock's group here in freiburg who has been working on depths from stereo for for several years and whose students really are experts in this and we could nevertheless improve the error by a relative 10 percent um so that that goes to show darts sometimes really works and this is very nice but darts also has really bad failure modes um it often uh fails horribly returning only to generate architectures so only skip contacts only parameter-less architectures and in a paper this year's iclear we actually showed that this behavior consistently for 12 different neural architecture search benchmarks every single time we got only skip connections and and or only parameterless um connections and and that of course is just yeah very bad if you want to think about using this in the inner loop of an rtml system you can't have this this sort of behavior um so we next looked at why this happens and so it turns out that well the validation error here on the left that actually shrinks over time that's not the problem even though this by-level optimization there there's no guarantee of this alternating std that this works that actually does work um the issue is that the curvature um in the space actually increases and you're running into sharp local minima in the architectural space so um here on the on the right we're actually plotting the dominant eigen value over time and we see that that's actually going up as the test loss is going up over time so first the test loss is going down nicely but then over time it's going up and this is because well we're going into these parts of the space of high curvature and then if you recall what darts is doing well it's doing this discretization step so that's actually a sizeable step in the architectural space and if you have high curvature then making a large space a large step in in this architecture space then that actually of course gives you really bad performance and so that that's where the performance loss happens it's in this discretization step because you have high curvature in the local area that you're ending up with in the optimization and the solution is to avoid this high curvature either by early stopping or by adding some amount of regularization using kind of any of our different methods we tried l2 of the inner objective and we tried scheduled drop pass both of these works both of these avoids the high curvature and leads to much better performance so here is um starts on fifa 10. it actually tuned nicely it works but as soon as you just go to c400 with the same code you get very bad performance and this robot starts get gets much better performance svhn similar story on ptp also similar story and other teams have also run this on imagenet afterwards we didn't have the resources for that but got very substantial gains in in in the image that architecture by um using our robot starts um and there was also a nice follow-up paper that i i wanted to mention because i quite liked it that also proposes another type of regularization that specifically looks at the perturbation so it's a perturbation based regularization that either just adds random noise to the architectural space or adversarial steps and that yielded even nicer performance so um i here's a plug for this paper very nice all right so this was weight sharing and the one shot model next i want to talk about meta learning so learning across data sets and in this space chelsea finn has done a lot of very nice work and for example proposed this model agnostic meta learning that has so this this inner task where you have a task where you compute a loss and you update your task rates with hdd but then there's also this outer loss where you update your meta weights that um where you try to find an initialization of networks such that if you do a few inner task optimization steps you get good performance for different tasks and how can you combine that with neural architecture search well one related work that by by kim adele would be to simply put neural architecture search around this and and that works because well there there is an architecture in here and you can well just do a search on that neural architecture um however that that is getting quite expensive and what we did instead is we actually merged the architecture optimization in here and in here so we in the outer loss we update the meta weights and the meta architecture and then for the task at hand we actually do fine tuning of the architecture together with fine-tuning of the weights and that approach called metamask actually um is much cheaper than than this prior work and deals clear new state-of-the-art results so here's a table from the paper that's sort of the table reviewers want to see with completely incomparable methods well not method but completely incomparable pipelines so different training pipeline and different hyper parameters that we still do best because we did a lot of engineering for it here is the more scientifically honest comparison with the same training pipeline and the same hyper parameters for well this original reptile algorithm and then reptile with the neural architecture so it wrapped around and reptile with the neural architecture switch together with the individual weights and yeah this approach is also 10 times faster than this approach and yields the best performance in most cases especially if you run it for longer than actually in all cases it gets the best performance all right so that's what i want to say about meta learning and then the last speed up technique i want to mention is multi-fidelity optimization so the key idea here is to use cheap approximations of the black box such that the performance of these on these cheap approximations correlates with the performance on the expensive black box um you could have all kinds of different cheap approximations um for example you could use subsets of data or fewer epochs of iterative training algorithms or you could downsample your images you would use shorter mcmc chains and basic deep learning or in deep reinforcement learning you could use your trials and the simplest method you could use here is success of having this is shown here on the right for a wall clock time as a budget it's a very simple algorithm that basically says take a bunch of different um configurations architectures or hyper parameter settings it doesn't really matter train them for a while look at the worst half and the best half and the worst half you just cut off the best half you continue running and then you double the budget and look here again cut off the worst half let the rest running double the budget cut off the worst half let the rest running and you end up with very good architectures that you have put quite a bit of time in but you haven't put all this time into these weaker architectures and you've saved a lot of time so that's a very nice very simple approach there are many more complex methods including some some by ours by ourselves combining this with bayesian optimization and so on but in the end actually the the combination of success of having or its extension of hyperband um that i would um advocate is is to combine this with bayesian optimization in a very simple manner where you basically use hyperband but you don't or success of having but you change one thing namely which configurations you actually put in where success is having in the original version actually looks at random configurations we sample these with basic optimization and that gives much better performance for longer running runs so just one quick uh showcase of multifidelity in action um i showed this result here earlier where we used nas with darts to improve um state-of-the-art performance on zintel by 10 afterwards we actually applied bob to this and got another 10 reduction on top so um and and this is um at a run time that's definitely less than an order of magnitude than a standard black box methods all right going on to the last two points which are relatively short so autopi torch is a method for join neural architecture certain hyperparameter optimization and it's it's kind of follow-up of our first work on autonet which actually did join here at arctic research and hyperparameter optimization back in 2015 and actually to the best of my knowledge was the first automatic deep learning system that that won a machine learning competition data set against human experts so there it got area under the curve of 0.9 whereas the best team of human experts and there were more than these 10 actually only got 0.8 and um they had something like two months in a kaggle um style challenge um to to tackle this data set so that was already a quite nice um showcase back then back then it was relatively slow with black box optimization and in the meantime now we're actually using multifidelity optimization and we're using meta learning to warm start this process across data sets and this is a work in an archive paper earlier this year so yeah and actually this is also open source and we haven't been pushing this for a long time so um but given that we've only been pushing this for about half a year um i think it's it's finding a big following already so um in brief this is um the design space for autopilot tabula for tabular data we use a so called shaped resnet so and we also look at mlps where we basically configure the number of blocks and the number of groups as you would for residual networks and well residual networks are typically used for convolutional neural networks but there's nothing that is holding us back from also using residual connections for tabular data and we do and it actually helps it's quite a bit better than the mlp nets um so the design space is relatively small we only have something like 12 um architectural choices and 12 hyper parameters so about a dozen each and here's how we search this we do multi-fidelity metal learning so multifidelity optimization with bob we have as much as the number of sgd epochs this is 12 25 or 50 so actually relatively small and so we can only get something like a speed up of four because the cheapest budget is only four times smaller than the most expensive budget but nevertheless we actually often get this factor of four um sometimes only in the beginning sometimes um we would get overtaken in the end by optimizing only on the full data set but sometimes actually even after a long time bob is still quite a bit faster than blackbox vo and the the more important contribution here is to actually also use meta learning so to initialize the search with complementary configurations that cover a lot of previous data sets so what we did here is to look at 100 different data sets and look at what are the instantiations of autopilot that would do well on these previous data sets and make sure that we get a set of complementary configurations using our greedy submodular optimization and we also compared this against the simple portfolio where we would just put 100 configurations that did best on each of the previous data sets and sometimes a simple portfolio is just as good as our submodular build portfolio and both of them are a lot better than bob and sometimes well this uh smarter way of building the portfolio is actually very important compared to this simpler way and then in all cases definitely the warm starting helps a lot to the orders of magnitude faster than um just basic optimization with multifidelity here's an evaluation against other optoml frameworks and here we do really well compared to auto keras and auto scale learn and hyper opt sklearn and we are doing further evaluations right now on the opeml or term l benchmark so that's 39 data sets that were collected by different people and we also do an evaluation against auto glue and with the auto glue and team in order to make sure that everything is fair in that comparison one might wonder whether the same approaches also work for image data and actually yes they do so we evaluated this on lastbench201 where we have three datasets c410 c400 and downsampled imagenet and so we really only have two datasets that we can train on a metatrain on and one to meta test on and even though we only have data from these two previous data sets um the warm sighting really helps a lot the multi-fidelity optimization here doesn't help because there is actually quite weak correlations across budgets in nashville 201 but the warm starting here gives you one to two orders of magnitude speed up over yeah just on doing basic optimization or bob by themselves and actually if you compare this to gradient-based math it actually is something like four times faster to reach ninety percent on cfar so doing this multi-fidelity method learning actually gets you to the realm of being extremely competitive with gradient-based neural architecture search and here you have the advantage you can trivially put in a few hyper parameters and optimize them exactly the same way that we optimize also the neural architecture all right that brings me almost to the end i just have um i think three more slides about this extended problem formulation on your ensemble search so here the motivation is as follows on ensembling often improves the performance of neural networks we all know that and recent results also actually show that if you use an ensemble of a neural network just 10 copies of the neural network that we're just 10 and instantiations of sgd and the different minima found that actually gives you a very nice predictive uncertainty and also robustness to distributional shift and in particular better results than all the bayes and deep learning methods out there typically and but that that seems suboptimal because diversity among the base learners predictions is really key for strong ensembles and achieving diversity through different architectures hasn't been studied so far so so all that they did is take a single good architecture and take multiple copies of that and what we do what we propose is to actually use neural architecture search to search for strong ensembles of diverse architecture so this is not directly encoded that we want them to be diverse but in order to be complementary and give good ensemble performance they automatically become diverse so here i visualize that different architectures actually give you diverse predictions so here's five different architectures and their predictions in a disney encoding and here is several different architectures found by neural ensemble search and here is the on the same problem the deep ensemble so this is a single architecture a single good architecture and we see the diversity here is much smaller in the predictions than the diversity in the architectures coming from nest and this results actually also in better performance because greater diversity helps to to reduce errors so here we look at um the neural ensemble search as the number of networks evaluated becomes larger and larger so as the search progresses um with three or so um three architectures in the ensemble five architectures 10 architectures and 30 architectures and likewise for deep ensembles based on the architecture found by darts and the architecture found by a movement maybe a net um which is actually quite a bit larger than the architectures that um we look at and nevertheless we can um do quite a bit better than i'm within it especially um when we allowed many many members in our population because of the diversity in our ensemble members we can also directly optimize for robust predictions under data shift and if we do that then we actually become i'm very robust to um to these um yeah data shifts and we get very nice uncertainty estimate and much better than deep ensembles and yeah also actually achieve state-of-the-art performance for robust predictions under data shift all right that brings me to the end the takeaways i would like you to take away with you is that neural architecture search is really coming of age there's a lot of cool stuff happening there's a scientific community forming around neural architecture search that everybody can become involved in now there is open source there is benchmarks and best practices there's a ton of different ways in order to speed up neural architecture search so we don't need you to compute clusters anymore in order to do interesting research and this also becomes practical now so octopi chart really tackles joint neural architecture search and hyperparameter optimization and well you can also apply this i'm in this extended formulation to find great ensembles there's links to code and um there's a book on your architecture search with that i would like to thank you very much for your attention and also many things to finding sources and especially to my fantastic team so thanks a lot great thank you for the incredible talk it's it's a head spinning amount of of work that you put in there um and it was really really interesting to see like the different aspects you touched on very interesting aspects from nass work to hyperplanet optimization to join nasa and hyperventilate optimization to meta learning to ensembling uh with mass it's super interesting uh we had i mean quite quite a large amount of questions that you uh i guess you didn't get bored during this talk in replying to all of those um it's like it was actually quite frenetic i think uh to want to to stay you know stay up to speed with what was coming in but one thing that i wanted to do and while people can can keep throwing questions in the q a i'm i'm keeping an eye on it and if they have them i'm going to ask them live uh but in the meanwhile we have kind of like a block of questions on nasb301 301 that came in from different people and i kind of have my own and i want to kind of bring them all in into one single question so let me start with a string of questions from the dip today um uh who's asking about uh you know in 301 50 000 architectures are used for training the predictor um and uh oh it's beyond the 25 000 randomly sampled architecture uh you also add the pathway of popular nas algorithms doesn't this make the predictor biased and i think you answered in the chat but would be useful i think because it's an interesting question to answer them here as well and then i'm going to ask the follow-up and then i follow up and then i follow up yeah okay great um yeah those are those are very good questions about last page 301 and of course i um only talked about it for a few minutes and we have a lot of experiments in the paper that answer a lot of these questions um it is a good point we can't cover this um space exhaustively well because it's i'm 10 to the um i forgot the exact number i'm 40 or 10 18. in this case only and another version has 10 to the 23. but so it's huge we can't possibly do this um and so we're going to miss some and we're really trying to think of what are the best architectures to evaluate in order to cover our bases in order to cover all the different parts that we need in order to make a good mass benchmark and a good nas benchmark of course i i think needs to be kind of good in the interesting areas nas optimizers try to find good architectures and i think that's why we should actually try to make particularly good predictions for the good architectures for the bad architectures well one could argue for example for these robustness issues of darts it's really important to actually also cover that and to not just say well that's a relatively bad one but to say that that's a really really bad one and that is a failure mode that we know and because of that we actually did put and i didn't talk about that but we did put additional training data in there from all kinds of architectures that are derived from other architectures and then add in more and more parameter less operations so more skip connections more um yeah more max pooling and so on in order to make these architectures worse and get better training data from that and and we actually um still need to do an ablation study to see on how much that helped but i'm sure that helped for these i'm very bad architectures to show that they're very bad um i don't think that without these really bad architectures we would see any differences in the rankings of the different nas optimizers and and i think that's um one of the things that that while people tend to do a lot with last benchmarks is to rank different methods and and also in terms of the bias well it's biased in the sense that you make good predictions for the architectures that you trained on it's not biased in the sense that you're predicting the architectures that you trained on to be good um so you're actually getting better for those and for others that you haven't trained on or your uncertainty is larger and you might potentially make larger errors if if there are some parts of the space that nobody has ever considered and um that we don't have training data on and there the extrapolation error would be larger and the uncertainty is larger we also have uncertainty estimates i see yeah so so that's interesting so you touch detached on this in your answer but there is a follow-up question where uh which was asking

Original Description

Neural Architecture Search (NAS) is a very promising but still young field. I will start this talk by discussing various works aiming to build a scientific community around NAS, including benchmarks, best practices, and open source frameworks. Then, I will discuss several exciting directions for the field: (1) a broad range of possible speedup techniques for NAS; (2) joint NAS + hyperparameter optimization in Auto-PyTorch to allow off-the-shelf AutoML; and (3) the extended problem definition of neural ensemble search (NES) that searches for a set of complementary architectures rather than a single one as in NAS. Slides for this talk are available: https://www.automl.org/talks Frank Hutter is a Full Professor for Machine Learning at the Computer Science Department of the University of Freiburg (Germany), as well as Chief Expert AutoML at the Bosch Center for Artificial Intelligence. Frank holds a PhD from the University of British Columbia (2009) and a MSc from TU Darmstadt (2004). He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, and with his coauthors, several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning. He is the recipient of a 2013 Emmy Noether Fellowship, a 2016 ERC Starting Grant, a 2018 Google Faculty Research Award, and a 2020 ERC PoC Award. He is also a Fellow of ELLIS and Program Chair at ECML 2020. In the field of AutoML, Frank co-founded the ICML workshop series on AutoML in 2014 and has co-organized it every year since, co-authored the prominent AutoML tools Auto-WEKA and Auto-sklearn, won the first two AutoML challenges with his team, co-authored the first book on AutoML, worked extensively on efficient hyperparameter optimization and neural architecture search, and gave a NeurIPS 2018 tutorial with over 3000 attendees. Learn more about the 2020-2021 Directions in ML: AutoML and Automating Algorithms virtual speaker series: https://aka.ms/d
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 50 of 60

1 Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
2 Frontiers in Machine Learning: Climate Impact of Machine Learning
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
3 Frontiers in Machine Learning: Security and Machine Learning
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
4 Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
5 Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
6 Remote Work and Well-Being
Remote Work and Well-Being
Microsoft Research
7 Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
8 Towards a Practical Virtual Office for Mobile Knowledge Workers
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
9 Impact of COVID-19 crisis on the future of work in India
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
10 Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
11 How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
12 Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
13 Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
14 Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
15 Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
16 Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
17 Directions in ML: Algorithmic foundations of neural architecture search
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
18 MineRL Competition 2020
MineRL Competition 2020
Microsoft Research
19 Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
20 From Paper to Product
From Paper to Product
Microsoft Research
21 SkinnerDB: Regret Bounded Query Evaluation using RL
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
22 From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
23 Programming with Proofs for High-assurance Software
Programming with Proofs for High-assurance Software
Microsoft Research
24 Platform for Situated Intelligence Overview
Platform for Situated Intelligence Overview
Microsoft Research
25 Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
26 Galactic Bell Star Music Demo
Galactic Bell Star Music Demo
Microsoft Research
27 Importing Animations in Microsoft Expressive Pixels (9 of 9)
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
28 Welcome to Microsoft Expressive Pixels (1 of 9)
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
29 Getting Started with Microsoft Expressive Pixels (2 of 9)
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
30 Creating an Image in Microsoft Expressive Pixels (3 of 9)
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
31 Creating Animations in Microsoft Expressive Pixels (4 of 9)
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
32 Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
33 Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
34 Using Layers in Microsoft Expressive Pixels (7 of 9)
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
35 Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
36 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
37 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
38 Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
39 Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
40 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
41 Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
42 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
43 Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
44 Novel Image Captioning
Novel Image Captioning
Microsoft Research
45 Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
46 Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
47 How does holographic storage work?
How does holographic storage work?
Microsoft Research
48 The physics of hologram formation in iron doped lithium niobate
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
49 Introduction to coax: A Modular RL Package
Introduction to coax: A Modular RL Package
Microsoft Research
Directions in ML: "Neural architecture search: Coming of age"
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
51 Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
52 Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
53 Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
54 Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
55 Clinical Research with FHIR
Clinical Research with FHIR
Microsoft Research
56 Soundscape Street Preview
Soundscape Street Preview
Microsoft Research
57 Tilt-Responsive Techniques for Digital Drawing Boards
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
58 SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
59 Haptic PIVOT: On-Demand Handhelds in VR
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
60 SurfaceFleet Supplemental Video Demonstration (UIST 2020)
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research

The video discusses Neural Architecture Search and its recent advancements, including benchmarks, best practices, and open-source frameworks, as well as various techniques and tools used in NAS. The speaker highlights the importance of evaluating and comparing different NAS methods and applying meta learning and multifidelity optimization to NAS.

Key Takeaways
  1. Evaluate architectures with neural ensemble search
  2. Compare performance of deep ensemble with 30 architectures and a single good architecture
  3. Optimize for robust predictions under data shift with neural ensemble search
  4. Apply joint neural architecture search and hyperparameter optimization with OctoPi
  5. Use random sampling to cover a large space of architectures
  6. Add pathways of popular NAS algorithms to the predictor to make it more robust
  7. Add additional training data from bad architectures to improve the predictor's performance
  8. Use uncertainty estimates to mitigate extrapolation error
💡 Neural Architecture Search is coming of age with open source, benchmarks, and best practices, and can be applied to various machine learning tasks to improve performance and efficiency.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →