Advancing Autonomous Vehicle Development Using Distributed Deep Learning with Adrien Gaidon -...
Skills:
ML Maths Basics90%Distributed Systems80%CV Basics70%AI Systems Design60%Supervised Learning50%
Key Takeaways
The video discusses advancing autonomous vehicle development using distributed deep learning, with a focus on deploying distributed deep learning in the cloud, at scale, and the team's work on building a distributed deep learning infrastructure in the cloud that scales well and is highly performant.
Full Transcript
[Music] hello and welcome to another episode of twimble talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington [Music] if you missed our last show and if you did you definitely want to go check it out because it was a great conversation but if you missed that show you missed the first of the many exciting updates we have for you this summer last time we announced will most third birthday and our five millionth download which happened right around the same time to help us celebrate this occasion and to request your commemorative twill mo birthday sticker visit 1200 Icom slash birthday 3 this week we're continuing the action by kicking off volume 2 of our AI platform series you recall that last fall we brought you AI platforms volume one featuring conversations with platform builders from Facebook Airbnb LinkedIn open AI shell and Comcast this series turned out to be one of our most popular series of shows ever and over 1000 of you downloaded our first ebook on machine learning platforms kubernetes for machine learning deep learning nai well we'll be back at it over the next few weeks sharing more experiences from teams working to scale and industrialize data science and machine learning at their companies and we've got even more in store on this topic so if it's an area you're interested in be sure to stay tuned you can follow along with the series at Twilio comm / AI platforms - and by following us on twitter at at some Carlton and at 2ml AI before we dive in I'd like to send a giant thanks to our friends over at cig opt they've been huge supporters of my work in this area and I'm really excited to have them as a sponsor of this series of shows on machine learning and AI platforms if you don't know Sega I spoke with their CEO Scott Clark back on show number 50 their software is used by enterprise teams to standardize and scale machine learning experimentation and optimization across any combination of modeling frameworks libraries computing infrastructure and environment teams like to Sigma who will hear from later in this series rely on sig up software to realize better modeling results much faster than previously possible of course to fully grasp the potential of a tool like Sig ups it's best to try it yourself that's why sig opt is offering you the twill community an exclusive opportunity to try their product on some of your toughest modeling problems for free to take advantage of this offer visit 1200 a i.com slash sig out alright everyone I am here with Adrian Gaiden Adrian is a machine learning lead at Toyota Research Institute Adrian welcome to this week in machine learning in AI I'm super happy to be here thank you for inviting time so we are here in Las Vegas at the AWS reinvent conference where you gave a talk and we will dig into the topic of your talk which was about advancing autonomous vehicle development using distributed deep learning but before we do that I'd like to hear a little bit about your background how did you get into machine learning yeah absolutely so I've been doing deep learning and machine learning for more than 10 years now I was really interested in Italy in human learning human psychology but I also realized computers and building stuff so mission earning in AI kind of was like a natural match made in heaven and so I started doing a double major in computer science and math and and at the same time looking into AI more and then I did an internship at INRIA in the very well-known group from Kadesh MIT in computer vision where I participated to some competitions like the ancestors of image net so Pascal vo see visual object challenges which I warned in 2008 and then I continued the PhD that was with Microsoft research and in RIA they've joined Center in Paris where I was working on video understanding more specifically human action recognition okay and after dad's I joined xrc Research Center Europe I knew you actually interviewed a friend of mine in my former boss Nylund Marie Oh computer vision team there okay great episode by the way and so I joined them as a research scientist and worked in video analysis in German also started a research effort there at the same time deep learning emerged so that's when I really transitioned from principled convex optimization and kernel method the alchemy of deep learning and never looked back since and are we saying alchemy just in celebration of the fact that NURBS is next week and and yeah and from then on did a lot of work on tracking and especially domain adaptation and because we didn't have a lot of data I had to make my own so I started looking to simulation a lot game engines to generate data and I did a couple of CPR papers on the topic that were noticed by the industry at large notes I was driving which at the time was like really getting into simulation and that's how I joined GRI because they're really dedicated to simulation and very very large-scale these problems only happen at a large scale if you have just small needs like Robo taxi etc you can just label data but at the very very large scale and you know I tell you there's number one car maker in the world hundred million cars on the road today you need to think about these problems and that's what gets me excited as a machine learning person because it's all about generalization and when you think about worldwide like Japan Australia us everywhere it has to work and that's what's really cool because you both have to invent new things in the research but you also have to make it work and and you get to touch on all these things so that that's how I got into it and I got really hooked into robotic space in general and autonomous driving in particular because such a great application for a mission earning okay before we get too deep into what you spoke about here what's what's the focus at GRI in general and then your focus there yeah absolutely so er I was created almost like three years ago now it's basically a separate company that was created by Toyota with 1 billion dollar funding initially and we got 2.8 billion dollar more and spin off a new company called trad advanced developments recently our focus is really like we're robotics company and our focus is really about autonomous driving home robots and we also do some material science research for designing better batteries and and things like this but most of our efforts is really in driving and the team Aelita in mission earning is really about research for autonomous driving we do also things for robotics a little bit because from our perspective a car is a robot is its sensory motor loop essentially you have perception prediction planning decision action and these feedback loops from the real world which is what exciting is a physical system and TR I really has a mission to improve quality of life in general I know it sounds very Silicon Valley but in that day that's actually true because we have already hundreds of millions of users and so the goal is one is the project called Guardian which is to make a car that can crash so it's the ultimate driver assistance system another one is chauffeur which is the real autonomous car like not the ones we're talking about today but the long term the long game which is real autonomy like these cars can drive themselves completely autonomously everywhere all the time which obviously is not gonna happen tomorrow we talked a lot about that one today we talked yeah so this is that it depends on the product right what people think and and here to you guys thinking really about the long term thing the cool thing is that these two Guardian and chauffeur in terms of the machine learning set of things have a huge intersection you still need a semantic segmentation object detection tracking a lot of the algorithms that we're talking about in computer vision are actually completely in common almost completely in common so from the perspective of my research I don't make a difference between these products because most of the research I do is very well aligned with those purposes and then we also do home robotics so we have like really really good teams there X NASA JPL etc where they work on mobile manipulation platforms so that to assist the elderly for home care and these kind of things oh and those Toyota have products in market in these in the home robotic space so actually Toyota manufactures a robot that's called the HSR the human support robot okay that was I think the official platform for the RoboCup recently so Toyota is really big a robotic walk up I was thinking Robocop well yeah Robocop with a you pardon my french yes so so yeah so the goal is basically how do we transform Toyota into a robotics company they have this amazing like industrial robotic side of course but like really what is the future of cars and it's gonna be robo cars but it's also gonna be robots beyond cars and also how they become a software company and and actually a machine learning company that's that's really what's exciting because this is at this scale of a company that they I want to change and CEO Akio Toyoda I was really talking about like you know the song that Andy Jesse is talking about these keynotes that the clash song if we don't do it we're not good if we do it we have to do it right that's that's really exciting tell me a little bit about the key message of your talk here at me and Ben yeah so here what we wanted to talk about was how we you can do a distributed deep learning infrastructure in the cloud that actually scales really well and is highly performant so when we started this thing when when I I took over the team a bit more than a year and a half ago I like Carrie were really well funded as I mentioned so I had other signs in my eyes and I was like all right I'm gonna buy so my GPUs I'm gonna splurge I'm gonna and and we had a server room we had everything there and and then actually it was still even if you have the money even if you have the means at your disposal it's still fairly slow to to ramp up and I had my garrison which was doing to talk with me which is our lead of infrastructure engineering was telling me hey what about Amazon I was like they have K 80s you know they'd have old GPUs it's slowing cetera but keeping an open mind we tried a couple of things and we got in touch with the ADA boys folks and and we did a lot of infrastructure work to really like make it work first single node then multi nodes and using PI torch to where PI torch up we used to be in anything shop and then a tensorflow shop and we really like switch to PI torch full-time a year ago or something and and the talk was really about this kind of journey through which we went from like yeah you have an on-prem compute and you can do stuff to really really large-scale distributed deep learning in the cloud that's efficient and efficiency is really the key here and in driving in particular there's one thing that is very different from let's say normal mission earning that you would see at nips or CPR which is we care about small networks that operate at a high resolution and there's two reasons for that one is that they need to be small because even if you could compress them quantize them and all these kind of things that we know what we can make them if more efficient still you need a smaller model initially to to fit in a computational budget that we have in the car because safety-critical so you have to have like really efficient models and the second thing is you need very high resolution because x equal space i was talking to my talk about like these weird equations talking about lean deep learning so you want like faster than around time and these kind of stuff we want to create some kind of Toyota Production system of deep learning and it's yeah so that we can iterate really quickly from idea to model to validation and go back to the drawing board because it's research and this idea of very high resolution is part of one of our constraints that we have to deal with because we want to predict things form far and so seeing far is like when you read the driver California handbook of drivers it tells you you have to look far in the distance to look far into the future and so resolution is kind of a key thing it's actually talking literally camera resolution camera razor camera resolution yep and specifically for the computer vision models that that we're using and so that means that the compute workload is kind of different because you have small models and very high resolution so in terms of dataflow operations time you spend in these metrics multiplies and and all these kind of things it's very different so we have can't down sample or crop everything in two to 24 nah sadly no it doesn't and so we had to if you use the standard tools like the data parallel or distributed data pearl from PI torch which are amazing at image net and these kind of stuff they didn't scale for us and so I had to rewrite a couple of things and that's what we talked about okay so let's let's walk through that journey so the use you mentioned that one of the first steps was you kinda had to build up the the infrastructure I had like at a node level from scratch was that where it started or was there were there steps before yeah no no so we started so that that's that's the cool thing about GRI is that we're fairly young and we're small and so there's no no technical debt bra because there's enough there's nothing when I started right and that was super cool because I'm as a research scientist I was mostly you know use this use that alright it's there you know you learn because it's this way use that that's a file system it's there okay and here was really just sky's the limit what you should do and so we really got the opportunity to use the best partner with the best we work directly with a lot of different partners and then we really create a different from scratch and first single node because it was really easy and ended all kinds of tricks now you have some machines that are monster machines you know 700 gigs of ram and so you can scale quite well but up to a point and so that's when we started to switch to using distributed file system so we did the BG FS bays distributed file system before we leave that initial node I thought I heard you say earlier that it was difficult and you had to go through a lot of steps oh yeah get on that first they get that first node up and running but you just said it was really easy assuming that means relative to a full distributed so you know it's this what's the I'm kind of curious about the you know that the pain points that you had to to go through just to get this up and running and also the extent to which they're still pain points or are there other things that have kind of wiped that all away so so yeah okay the first one is this space right so the first one is because of the scale of the data we have you cannot so for a lot of like that's a debug experiments or research experiments on small data sets you can fit them on the RAM and you should do that because that's just like the best bang for your buck but when you have a lot large data a large data sets then that becomes much more complicated and so we we first switch from the RAM disks to EBS volumes or more EFS or we tried to everything but for like these kind of like this high resolution small networks to not be network bound right to now have this GPU starvation problem where your average utilization of the GPU is like 15% or something ridiculous and these machines are expensive so you want to bump that that the 90% or above we we that's what we had to actually even before we start really doing distributed computations using a distributed file system enabled us to really download the data once and not every time you set up a machine because if you auto provision machines and you have to download data from its three every time you start a machine then you're saying like oh I have this idea that's way too hours before I can just like press play right right so that was a big pain points for research to have this fast turnaround time so the disputed file system was something that was very useful at a single node level and of course scaled to the multi node so we did it two birds with one stone and decided you end up with that we used the BG FS as file system and we're gonna look at lustre like these announcements that were made recently that's very interesting another pain point that we had was the sorry the BG FS you're managing yourself you just deploying it on the node in your a me or whatever yeah so we have like instead of instances that serve that file system that is then mounted on these instances and we have some infrastructure s code to just like spin this off like all configured and ready there's something around containers so we were baking stuff a lot into the a me into the machines themselves that way when they're started you're just there directly because not everybody was familiar with a docker but we picked up docker too because there's obvious reproducibility benefits and when you hack a lot of things quickly at the beginning of a research project having this kind of docker file where people can reproduce your environments and not just you know your your your experiments that's actually extremely helpful for collaboration in the team so we also ties back to that agility and being able to move quickly exactly so booting up a whole machine yes yes and and our IT folks were so happy because it's not like it's this doesn't work yeah but because you a PT gets installed something that wrecks the system and that's of course so DevOps so really embracing DevOps yeah even for researchers actually was quite powerful because you can only do the research that you know the mastery of the tools is really important to empower you to do research beyond you know just pipe Jupiter notebook let's say it's awesome tool but if you want to go beyond you need to master other tools and that's that's what we've been doing it's a journey through engineering craftsmanship as much as deep learning research mmm-hmm this is the you know when you talk about kind of applying DevOps in this world to what degree in your experience does it apply directly or are there you know gaps or it only takes you so far you have to modify the way you think about it yeah and I realized that I'm saying as if DevOps is this well-defined thing yeah I think it's a it's a good question I think there's like two ways to like let's say there's two extremes right there's the extreme if you do everything yourself and there's the extreme if you just use blindly something that someone does for you and in that space of you know all the grad students in the world in machine learning they spend considerable amount of time configuring their environments that's a skill we develop during our PhDs and and then docker and these kind of things if you if you don't become an IT guy or a DevOps guy but just learn from the best there and they do some of the things that go around security and and that's really important for data that we have that I don't know I don't have an inkling but they expose us to AWS services they expose us to some darker stuff so I'm not a nativist expert I'm not docker expert I'm not the kubernetes expert but knowing a little bit of that enable empowers you to try more bold research ideas and actually debug and when you care about the performance of your model not just in terms of its accuracy but its speed having these knowledge enables you to do research much faster actually which is counterintuitive a little bit but again when you're beyond em mist that's what it takes right right you started out doing a lot of this yourself yourself meaning like within you knows research signing of research scientists it sounds like you're presenting with an infrastructure person so now you've got kind of you know professional support yeah we do we do work really tightly with them I also my team is like probably like 30% engineers okay and it's it's really I think it's really good for research teams to have this mix of really scientists and engineers and because again as I said the lines are blurred at large-scale research and you need these two skills and obviously also like the all the DevOps and infrastructure engineering teams so the collaborative spirit that Tara is really really good like because we're small we're very tightly knit and because there was no technical debt we're building everything together and and really nothing that the infrastructure engineering built was done in isolation without consulting us so that's why we have a system that works really smoothly because all the concerns were shared and and addressed at the same from all the pieces of the puzzle so it's it's really nice to have that like kick-ass modern infrastructure built around around the youth somehow and with you yeah yeah and so did that did that infrastructure engineering team and support was that always there or did that you know come at a certain point after you'd you know built some things yeah it's a fairly recent addition so okay well we started kind of organically and then you had some people that were there and it started to be formalized only recently as we scaled up and what that need became much more obvious and is that infrastructure team primarily responsible for like was kind of the line that they how far up the SEC did they go they worrying about like tools and frameworks and software platforms or is it primarily you know infrastructure and you know network and disk in and file systems and connections to the cloud and all of that stuff so I I would say the latter so so I think you know the lines are blurry yeah but you need this single responsibility principle you know that applies well for software it also applies for organization you know there's this Conway's law that says that a software organization writes software that is architected in a way that reflects the organization right and so I think it's really good if you have like clear responsibilities but also the lines are a bit blurred because that means that you get a system that is flexible but you need these kind of responsibilities too so there's some separation and my team in machine learning research and we are the ones that made the decision to switch to PI torch for instance and the way we did that is that for inside were implemented Yolo myself a year and a half ago and all the different deep learning frameworks and it was after doing that like object detection is really nice because it's a structured prediction problem that's shoehorned into a classification one and so it breaks the API is that most frameworks support like from the get-go and so if you use that you know you're you're stretching a little bit the capabilities of the network in terms of their and the framework in terms of the api's uh-huh and so reimplemented Yolo and all these different frameworks made it clear that as a research scientist I value flexibility and pie George had the flexibility trainer is also very good there's other alternatives but debugging an extra so at certain levels like that's why I said like research scientists were making engineering decisions because choosing PI George is something that we wanted to make as a research scientist group and and for the reason of also of the particular research we're doing so for instance one of the things we're doing is with the paper recently called super depth which is a paper about predicting the depth from is of a scene from a single image and and so we self supervise method where is geometry at supervision instead of using labels because for that you can't label and and this is again another example over use super resolution so this idea for high resolution is actually important also for accuracy a few super resolved images this helps you predict better depth maps I was one of the key findings that we made in the paper and so all that is also enabled because of the choices we made on the software site and PI torch and all these kind of things and also the rounded community that there is around it so that enables us to really move fast and stand on the shoulder of giants so I talked to different organizations that have differing opinions on well how opinionated to be but for their organizations it sounds like you're of the mind to kind of standardize on in this case on PI torch at T RI as opposed to other places you know we're gonna build a kind of a framework a platform and it's gonna be able to support whatever the research scientist or engineer wants to use tell me through a little bit of the thing the way you think about that oh yeah I I think about it in almost mathematical terms it's the bias-variance tradeoff mmm and it's if you have a small bias right and if you have like a high variance and you're really favoring exploration for these kind of stuff you need a lot of people that are willing to support you right so if you say oh yeah slurm and kubernetes and fight or chant tensorflow and everything and a little framework that that random guy made on his own free time you know then you're you're so first of all like what is actually your business like like is it making those that infrastructure and and know for us it's not for us it's making awesome robots awesome machine earning so I clearly err more in the bias area but you know it's this I give a little bit of MapReduce right exploration expectations right off with you first you have high variance and for a little while you go wild you explore and you're maybe not bound by you implement yellow and every framework exactly something like this right and then but then at some point you make your decision right right that's not sustainable and so and and you want to move fast and it clearly identify direction once you have identified that direction and you never have enough data to prove that you're right so at some point you have to have Express the leadership and just go with it and then you go you go for it and of course you keep an open mind because then there's the next phase of exploration because your rights for only a short amount of time and in this field of deep learning did we take a diversion on kind of the the path that you laid out in the present oh yeah kind of we take a turn at step one we got beautifully sidetracked wonderful direction so so yeah so we were single nodes everything in the ROM and then moved to like tried existing storage solutions then moved to more like distributed file system and once we had this because it's an in-memory distributed file system we didn't have GPU starvation anymore but then our training was slow because we were limited to a single machine and then peetha instances happens we start to use 300 GPUs much better that required also tuning the storage again to avoid GPU starvation and then we again meant it to go into multi node and with the disputed file system that that at at least the data was easily accessible from all the different nodes and then that's when we started to hit the limitations of like distributed by torch which was very recent at the time before we jumped to distributed mm-hmm I'm curious about the you know you've got some I guess quote-unquote hyper parameters like virtual CPUs or you know the machine configuration parameters like you know they're kind of universal rules of thumb for that kind of thing that you figured out or do you experiment with it a lot is it job dependent a lot oh you overly focus on economic optimization like how do I work through all this so we optimize for time we don't have to okay that one was easy yeah that's that that was easy we haven't so that's more again the job at the infrastructure engineering people so does that mean you just get the biggest one with the best GPU and you got it exactly yes that's exactly it so and also because our workloads it was obvious that that was the only thing to do okay so go big or go home that's basically what we did yeah yeah so for a single machine we just like try to scale as much as possible on a single machine and that meant these big big instances we recite to you soon be the new ones that were announced there are even bigger so actually that's feedback that we directly gave AWS is it's quite cool to see that that we give them feedback a year ago and then like keynote was oh and we heard you we did this and so the biggest instances that they made that's that's something that we had asked for and a couple of other cool stuff so but you're still limited on a single machine and so when you are at kind of topping out at a single machine how long were your jobs running for so at this stage it was more in the order of weeks but that's what kind of job is this so so the main one in terms of like computational the most computationally expensive one is semantic segmentation okay because again it's like high resolution it's very dense it's dense prediction and so that that was the most computationally expensive job another type of job that we do that is also very expensive is a mutation earning so we do a lot of research on end-to-end driving the main reason is not so much that we believe that it's all you need to driving obviously not but we get a lot of data from actual cars and right and so we get a lot of demonstrations and so there's this really interesting research question that we're working on which is how much value can you derive from these demonstrations this is a form of supervised uper vision I'm driving right that you want to distill down into your models and so we do a lot of research there and that's you know use all the data is really the question that animates us how can we use all the data and because we can't label everything we're not going to active learning routes and the same thing that everybody else is doing because obviously we're doing that but that's not the open research challenge everybody knows active learning is a good thing to do when you labels things we're really interested in self supervised learning how can we really use all the data by leveraging geometry right for instance how do we use demonstrations at scale and so that's those are the workflows because motivated by the research direction we're going in those were the most intensive ones and a single machine these are things that easily take weeks okay okay so then that necessitated jumping over at a distributed training yes absolutely did you do that after the decision is stick to go with my torch or did you have to figure that out twice no we we had made so because also we we have a lot of like we're in Silicon Valley so it's really it's really nice that there's a lot of dense communication between people are not afraid to share their plans where they are going so so we know to some extent where things were going and we know where we wanted to go so we also were open about this with with different partners and so we knew that when we were gonna hit the distributed wall we would be ready for it so we had all the all those factors were factored in at the decision time mmm at the first one so we didn't have to revisit it later okay thankfully but you did have to sounds like wait on some PI torch features support doing distributed the way you want it yes absolutely so initially we were starting to be a little bit afraid that we would have to either fork or do some like really big upstream contribution to PI torch to me and as again as I was mentioning it's kind of like a niche application from the deep learning era like like it's a highly high-resolution semantic segmentation for instance it's not something that a lot of people are pursuing so we're starting to wonder if there was another way than two hits like low in the stack right and and we did like fairly intense debugging performance profiling and which is not easy in the cloud because everything is like in the aether and what we found actually and that's kind of like was was an interesting end of the debugging journey for performance optimization was that in the distributed setting when who had many machines and a very efficient distributed file system our ed box right our passes over the entire training data became really fast because we had this huge batch sizes and and and everything was flowing really well to the GPUs GPUs were crunching really quickly and and what happened is that there was like huge down times like wait like like it was a bottleneck somewhere and it turns out that that bottleneck which was hard to find was in the data loaders when you do your your you know you how you do multiple workers that prefetch the data for you in parallel to feed the GPUs like the super hungry GPUs like really quickly and in Python because you have this global interpreter lock you have to use processes enough threads to do that and so it's stuck by torch data loader starts workers which starts multiple work processes and and and forking like creating a process is much more heavy than creating a thread and when you do this very quickly in a distributed setting that actually became the bottleneck so we had to change just dataflow and the way we were doing these prefetching and those queues by having some kind of like always warm queues that were kind of like infinitely producing and then infinitely consuming on the other hand and we're playing with fire a little bit there because we're creating racing conditions and so that Lux can happen but because this doesn't sound like you know this doesn't sound like a plug-in or something that said this was totally not a plugin this was on top right this was this was something that we we were using in stock by torch except for the data rivers except for the data loaders where where we change the data loaders to something else and and and and that's what I mentioned by this warm producers infinite and disrupting conditions and recently we've been playing one more with Horvat that awesome open-source library made by uber and by torch it works with Pytor tensorflow and pi torch it start with tensorflow and now it's pi torch okay and actually this provides great MPI like interface and that enables so it's a little bit less efficient for our niche application but we have other applications and so the flexibility that you get o for Avadh might be worth the price and in performance so we're considering moving more and more stuff to horrible it sounds like you were able to you invested a little bit in kind of tweaking by torch to make it work but it kind of caught up and now you've got some solutions that work for you and so you are able to do distributed training like were you done like I pop the at the champagne so it's interesting and in one way yes because there is a lot of internal questions so like I said Chiara is a robotics company and one thing you have to understand is that in autonomous driving roboticists they do very things very differently than the really hardcore deep learning crowd which is they used to lidar sensors clustering methods like DARPA challenge stuffed I worked awesomely well and have like much stronger safety guarantees than what we do in deep learning and so they're not necessarily very experienced in in the deep learning way and so doing these kind of things also means that like training for weeks to develop an algorithm that sounds insane and so here doing this distributed training and showing them internally that hey you can do things really quickly in the cloud at scale and you can tweak your models and do your develop your algorithms almost as quickly as if you were not doing deep learning that was kind of like a champagne popping bottle popping moments where is it oh that's super cool now we actually like are gonna run with it of course we're not done on the research side now we can basically study what happens when you do self supervised learning at the on a lot of videos what happens if you do a mutation learning on really a lot of demonstrations and actually we we have a paper that we're gonna push on archive soon where we really push the boundaries of imitation learning and showed that you can go quite far with like deeper models and more data it's kind of like a prototypical deep learning story more data deeper models that works really well and that's only thanks to the infrastructure that we had that we had an awesome intern Felipe that could do these experiments thanks to that so we're not we're not there but we're definitely enjoying the fruit of our labor nice nice so the semantic segmentation that before you made it over to distributed was taking weeks what does it take now typically so we can do things in like under two hours now oh wow yeah well it's really fast yeah what does that require in terms of a cluster size so we we typically run jobs at I think right now beyond eight machines so beyond 64 GPUs for single networks right we find that we don't need to go beyond that at this stage so we don't do like a single network on 256 GPUs or something which is the most people that do that at least publicly do that it's just to beat speed records on the Internet which you know is nice it's not really what we were going for so for the jobs we do let's say between four and eight machines so 32 and 64 GPUs provides us with like a small turnaround time and good iteration speed for our research is it that it you know the complexity involved in going from 8 you know some multiple of 8 it isn't you know is is overburdensome is it that the value of going from 2 hours to you know 30 minutes isn't there so there's there's some like more infrastructure problems around like limitation of supply you know like we often joke Eri we have infinite GPUs because they're in the cloud but in reality it's not necessarily there because availability zones etc so some some things that I don't fully understand the other thing is also at some point you start to hit algorithmic difficulties so like for instance a year ago people were convinced you couldn't do large batch SGD because you would have generalization performance issues and that's when Facebook made their you know oh actually no it's just a numerical optimization problem you just got to do the linear scaling rule this warm up you have to twiddle a little bit and and then and then yes it generalizes the same way and that's when you have this explosion of large batch training methods but still it there's a limit to that right and depending on your data sets depending on your learning algorithm depending on also the data at hand right so the particular generalization gap that you have to overcome large like there's a good size of batch size so beyond like very like there's the limits to how bad how big your batch can be okay do you have a single cluster running at a time or do you use you know spin up multiple clusters and run multiple training jobs kind of constantly all the time and does that you know if that's the case or even if not really does that level of [Music] change drive you to use something like kubernetes or some kind of infrastructure may you've mentioned kubernetes and storm and some other things and yeah so right now the way we do it is we provision clusters on demand by the researchers so we tend to have a couple of like clusters / / researcher / projects so that's that's really nice also because it helps a lot with experimental management's you know like babysitting experiments it's a full-time job when you get closer to the deadline and and and having like these separate clusters for the separate workflows for different people that helps with just the cognitive load of words where you sweat cetera and we didn't feel like and again my team is like fairly small we're like 1213 people okay so we don't need necessarily we do very large experiments but we don't necessarily do many many different experiments which I probably have four or five projects at the same time okay so no need for like complex scheduling or monitoring or queuing or these kind of things it's gonna get there and we know it's so that's why we're preparing for that and I have like more an HPC experience so that's why I favor a bit slurm also because when we started having this discussion kubernetes was not supporting GPUs now they do and the only thing I'm a little bit afraid of is it adding indirection levels because again we care about speed and performance or the story this talk that I was talking mentioning before we could do that because we were a working tightly with the infrastructure engineering team or AWS or Nvidia we're actually talking to them directly and B we actually knew what was going on under the hood so we could pop up the look under the hood and say oh yeah this is wrong or this is wrong or this smells funny can you check this right and so if we do too much come add too many layers of indirection that kubernetes may might be that I don't know I'm not sure I'm a little bit afraid that we lose control and we lose interpretability innocence and our models already hard to interpret so you mentioned in passing the managing experiments experiment management have you built any higher-level tooling or infrastructure to help research scientists do that or is there something that you're using off-the-shelf or is it you know post-it notes and Excel spreadsheet yeah something so we did our fair share of Excel scheduling of course I do that a little bit but we had an interesting journey where we initially used tensor board but then tensor board didn't scale for us because data on disk and so it just like didn't work we switched to visit them but visit um is a little bit too bare-bones it's very flexible but there was also other issues there so we're really starting to think about this and at the same time we got in touch with a company startup called wnb weights and biases double OMB weights and biases yep Lucas and stay basically like because was just creating this company and baby talk to us and open AI and looking for like what do you guys need and and and we really worked really tightly with them we happy customers now and we have this really cool like experiment dashboard experiment management system where we can do a lot of visualization of experiments multi-user multi project it scales really well and yeah so that that's what we use today and so we're really again not optimizing for cost we're optimizing for time right and because there's a lot of excitement around machine learning there's a lot of opportunities to work with great partners so that's that's our approach when you first mentioned scale there you were talking about on this performance of tensor board but then later when you're talking about is scaling like how is it doing in terms of I mean you're not a huge organization but is it scaling in terms of the number of experiments that you do so right now we're like probably less than 20 users so that's the previs so I can't say about the scaling in a user but yeah we do a lot of experiments like I mean as you know researchers we we a lot of research is the spaghetti plate strategy which is you throw it at the wall and you see what sticks so you have a high burn to optimization and all these kind of things means that a single researcher especially when you have this nice infrastructure in terms of machines and experiments you can run you're gonna you're gonna like have a firehose of metrics you want to visualize and so and so that scales really well for mm-hmm and again we're not in the business of making dashboards or these things so we're really happy to partner or buy whatever is not our core business which is really about this deep learning models for driving mmm and does WMV do the the hyper permit optimization for you they have fun you serving oh yeah so we do we do our own stuff there they have some services there which we don't use but we I think hibernative optimization is like I'm still on the fence whether this is something internal or something that we can partner with because there's like typical patterns and typical like algorithms and I've been a big user of hyper ops so you can do by in hyper energy optimization and these kind of things and sure this is like almost standardized so you could imagine having a service for that instead but some of these things actually have to hook up really deep into so it depends on the model and the research project you're doing so there's a blurred line there and but there's awesome like general-purpose algorithms I think the one that I use the most recently that I really like is hyper band which is kind of like a bandit approach that that's using the fact that it's our optimization is sequential so you can restart from a checkpoint and continue and these kind of things so yeah I on defense some of the things in-house some of the things blackbox it's not I'm on defense and I have four discs and so I'm making it something that you don't have to worry about besides from the kind of distributed file system issues that we've talked about a lot of the kind of traditional enterprise data management you know data leaks data warehouse hooking into data stores that's if you just need big discs to store stuff yeah so we have like like we use s3 a lot okay and we used to use s3 directly and so basically what we've been doing is just like what happened with processors we add layers of cash you know and monitoring hotter caches which are getting smaller as they're getting hotter and and we have to like a synchronous Li preemptively fill those caches and these kind of things so it's always these caches like the you know the various s3 features like glacier and all these yeah so you can you can see basically all these different types of storages right yeah as these cache like it's like these are I call them caches it metaphorical cache right right as three we used to use s3 directly as like that's three two GPUs and that obviously then like didn't scale and so we we added this like you know different storages and this advances all those things yeah and then you have these prefetching queues that are literally filling the RAM with negative next to your GPUs and so you have this ultimate layer of cash right okay and then on the back end or you do you have to worry about inference and not observing so at this moment the inference is the models we serve are in our robots right so that's the big earning right is that the models we serve are the ones are gonna be in the path to actuation of the car so there we have like amazing driving technology teams like that own parts of the stack like we have an object perception team we have a slam team we have planning and controls team and these guys they basically take our our models or make their own and they you know make them more efficient fit them you know in that in the computational budget that they have and that's how we serve models so that's a fairly different model then let's say web based application yeah yeah and is that that process of getting the models to fit you know compression or pruning or what have you is that there's a still a manual process to a large degree yeah so so to some extent there is these it's gonna be a little bit weird because there are some ways our product is a little bit like it as tensor RT and these kind of things but it's still more an art than a science now it doesn't always work it works for certain types of network like out of the box but some it doesn't and and so do you have some you know tools or a bag of tricks that are the internal purpose that you can throw at this and that you should throw at this definitely and that our teams are doing but upstream or upstream on the research side and my team what we're doing you can learn models that are compressible or that are amenable to compression by having some compressibility factor built in you can you can have also small models like as I mentioned right and there's more and more research research results that show that small models can generalize as well as big models they just have to train for longer or you have to change the learning algorithm and another thing is one of the big things we're trying to do is how how far can we do multitask learning right because if you can have a shared backbone and squeeze many different things yeah that's awesome and the one recent project that we did around Panoptix segmentation is basically this is basically take semantic segmentation and take instant segmentation so masks are CNN and these kind of things and and and that works really well but that's extremely slow I think like master stands for like hundred fifty milliseconds per image or something and and you have to basically reduce those models and and merge them together with maybe different heads to make it efficient and we made a recent paper it's gonna be an archive soon called task net for things and stuff consistency Network where we basically merge them together and have a task consistency across task consistency because the main problem of multi task learning is if you just sum the losses it doesn't necessarily like they maybe contradict each other it's a bit like when you're arriving at an intersection yeah it says turn left or turn right you don't know you go in the middle there's a lot work not a good idea so like imagine the gradients might be pushing in orthogonal directions so one of the key things we did is we actually augmented the objective to have a consistency encouraging objective and between the stuff classes like so Road sky etc on the semantic segmentation side and the thin classes on the instant segmentation side so merging these networks is one way to to be more efficient and yeah that's another very reason work that we've been doing how have we done in terms of kind of getting a lay of the land of your presentation we went way beyond nice awesome any kind of parting thoughts or words no I think you know we're really excited to write to continue this direction of large-scale deep learning in the cloud and tackling this really really challenging open research questions yeah so we're we're continuing to grow very very fast and excited to be in that space self-driving robots and and with deep learning so very happy to have been able to talk about it well thanks so much Adrian it's all something have you Michelle it was there all right everyone that's our show for today for more information about today's guest or to follow along with our AI platforms vol 2 series visit to my comm / AI platforms - thanks once again to sig up for their sponsorship of this series and support of the show - check out what they're up to and take advantage of their exclusive offer for twinners visit Twilio comm / cig opt as always thanks so much for listening and catch you next time [Music]
Original Description
In this, the kickoff episode of AI Platforms Vol. 2, we're joined by Adrien Gaidon, Machine Learning Lead at Toyota Research Institute. Adrien and I caught up to discuss his team’s work on deploying distributed deep learning in the cloud, at scale. In our conversation, we discuss:
• The beginning and gradual scaling up of TRI's platform.
• Their distributed deep learning methods, including their use of stock Pytorch.
• Applying devops to their research infrastructure, and much more!
The complete show notes for this episode can be found at twimlai.com/talk/269.
Thanks to SigOpt for their continued support of the podcast, and their sponsorship of this episode! Check out their machine learning experimentation and optimization suite, and get a free trial at twimlai.com/sigopt.
Finally, visit twimlai.com/3bday to help us celebrate TWiML's 3rd Birthday!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from The TWIML AI Podcast with Sam Charrington · The TWIML AI Podcast with Sam Charrington · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Engineering Practical Machine Learning Systems with Xavier Amatriain - #3
The TWIML AI Podcast with Sam Charrington
How to Build Confidence as an ML Developer with Siraj Raval - #2
The TWIML AI Podcast with Sam Charrington
Open Source Data Science Masters, Hybrid AI, Algorithmic Ethics & More with Clare Corthell - #1
The TWIML AI Podcast with Sam Charrington
Interactive AI, Plus Improving ML Education with Charles Isbell - #4
The TWIML AI Podcast with Sam Charrington
Machine Learning for the Stars & Productizing AI with Joshua Bloom - #5
The TWIML AI Podcast with Sam Charrington
Generating Labeled Training Data for Your ML/AI Models with Angie Hugeback - #6
The TWIML AI Podcast with Sam Charrington
Explaining the Predictions of Machine Learning Models with Carlos Guestrin - #7
The TWIML AI Podcast with Sam Charrington
Deep Learning: Modular in Theory, Inflexible in Practice with Diogo Almeida - #8
The TWIML AI Podcast with Sam Charrington
Emotional AI: Teaching Computers Empathy with Pascale Fung - #9
The TWIML AI Podcast with Sam Charrington
Statistics vs Semantics for Natural Language Processing with Francisco Webber - #10
The TWIML AI Podcast with Sam Charrington
Building AI Products with Hilary Mason - #11
The TWIML AI Podcast with Sam Charrington
Reprogramming the Human Genome with AI, w/ Brendan Frey - #12
The TWIML AI Podcast with Sam Charrington
Understanding Deep Neural Networks with Dr. James McCaffery - #13
The TWIML AI Podcast with Sam Charrington
Scaling Deep Learning: Systems Challenges & More with Shubho Sengupta - #14
The TWIML AI Podcast with Sam Charrington
Domain Knowledge in Machine Learning Models for Sustainability with Stefano Ermon - #15
The TWIML AI Podcast with Sam Charrington
Machine Learning in Cybersecurity with Evan Wright - #16
The TWIML AI Podcast with Sam Charrington
Interactive Machine Learning Systems with Alekh Agarwal - #17
The TWIML AI Podcast with Sam Charrington
Location-Based Intelligence for Smarter Marketing with Klustera - #18
The TWIML AI Podcast with Sam Charrington
AI-Powered Customer Support with HelloVera - #18
The TWIML AI Podcast with Sam Charrington
Using AI to Simplify the Programming of Robots with Cambrian Intelligence - #18
The TWIML AI Podcast with Sam Charrington
Increasing Efficiency of Healthcare Insurance Billing with NLP, w/ Behold.ai - #18
The TWIML AI Podcast with Sam Charrington
Creating a Worldwide Financial Knowledge Graph with AlphaVertex - #18
The TWIML AI Podcast with Sam Charrington
From Particle Physics to Audio AI with Scott Stephenson - #19
The TWIML AI Podcast with Sam Charrington
Selling AI to the Enterprise with Kathryn Hume - #20
The TWIML AI Podcast with Sam Charrington
Engineering the Future of AI with Ruchir Puri - #21
The TWIML AI Podcast with Sam Charrington
Deep Neural Nets for Visual Recognition with Matt Zeiler - #22
The TWIML AI Podcast with Sam Charrington
Introducing Psycholinguistics into AI with Dominique Simmons- #23
The TWIML AI Podcast with Sam Charrington
Reinforcement Learning: The Next Frontier of Gaming with Danny Lange - #24
The TWIML AI Podcast with Sam Charrington
Offensive vs Defensive Data Science with Deep Varma - #25
The TWIML AI Podcast with Sam Charrington
Global AI Trends with Ben Lorica - #26
The TWIML AI Podcast with Sam Charrington
Intelligent Autonomous Robots with Ilia Baranov - #27
The TWIML AI Podcast with Sam Charrington
Reinforcement Learning Deep Dive with Pieter Abbeel - #28
The TWIML AI Podcast with Sam Charrington
Robotic Perception and Control with Chelsea Finn - #29
The TWIML AI Podcast with Sam Charrington
Natural Language Understanding for Amazon Alexa with Zornitsa Kozareva - #30
The TWIML AI Podcast with Sam Charrington
The Power of Probabilistic Programming with Ben Vigoda - #33
The TWIML AI Podcast with Sam Charrington
Intel Nervana Update + Productizing AI Research with Naveen Rao and Hanlin Tang - #31
The TWIML AI Podcast with Sam Charrington
Video Object Detection at Scale with Reza Zadeh - #34
The TWIML AI Podcast with Sam Charrington
Enhancing Customer Experiences with Emotional AI, w/ Rana el Kaliouby - #35
The TWIML AI Podcast with Sam Charrington
Expressive AI-Generated Music With Google's Performance RNN with Doug Eck - #32
The TWIML AI Podcast with Sam Charrington
Smart Buildings & IoT with Yodit Stanton - #36
The TWIML AI Podcast with Sam Charrington
Deep Robotic Learning with Sergey Levine - #37
The TWIML AI Podcast with Sam Charrington
Deep Learning for Warehouse Operations with Calvin Seward - #38
The TWIML AI Podcast with Sam Charrington
Cognitive Biases in Data Science with Drew Conway - #39
The TWIML AI Podcast with Sam Charrington
Data Pipelines at Zymergen with Airflow, w/ Erin Shellman - #41
The TWIML AI Podcast with Sam Charrington
Web Scale Engineering for Machine Learning with Sharath Rao - #40
The TWIML AI Podcast with Sam Charrington
Marrying Physics-Based and Data-Driven ML Models with Josh Bloom - #42
The TWIML AI Podcast with Sam Charrington
Machine Teaching for Better Machine Learning with Mark Hammond - #43
The TWIML AI Podcast with Sam Charrington
LSTMs, Plus a Deep Learning History Lesson with Jürgen Schmidhuber - #44
The TWIML AI Podcast with Sam Charrington
Learning From Simulated & Unsupervised Images through Adversarial Training - TWiML Online Meetup
The TWIML AI Podcast with Sam Charrington
Jennifer Prendki Interview - Agile Machine Learning - TWiML Talk #46
The TWIML AI Podcast with Sam Charrington
Evolutionary Algorithms in Machine Learning with Risto Miikkulainen - #47
The TWIML AI Podcast with Sam Charrington
Learning Long-Term Dependencies with Gradient Descent is Difficult - TWiML Online Meetup
The TWIML AI Podcast with Sam Charrington
Word2Vec & Friends with Bruno Gonçalves -#48
The TWIML AI Podcast with Sam Charrington
Symbolic and Subsymbolic Natural Language Processing with Jonathan Mugan - #49
The TWIML AI Podcast with Sam Charrington
Bayesian Optimization for Hyperparameter Tuning with Scott Clark - #50
The TWIML AI Podcast with Sam Charrington
Intel Nervana DevCloud with Naveen Rao & Scott Apeland - #51
The TWIML AI Podcast with Sam Charrington
AI-Powered Conversational Interfaces with Paul Tepper - #52
The TWIML AI Podcast with Sam Charrington
Topological Data Analysis with Gunnar Carlsson - #53
The TWIML AI Podcast with Sam Charrington
ML Use Cases at Think Big Analytics with Mo Patel & Laura Frølich - #54
The TWIML AI Podcast with Sam Charrington
Ray:A Distributed Computing Platform for Reinforcement Learning with Ion Stoica -#55
The TWIML AI Podcast with Sam Charrington
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
Architecting for the Future: A Blueprint for Model-Agnostic, Business-Ready AI
Medium · AI
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · AI
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · Machine Learning
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI