Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Key Takeaways
Nicolas Koumchatzky discusses machine learning in production for self-driving cars, covering topics such as deep learning, online learning, and model deployment, with a focus on Nvidia's production-grade machine learning platform, MagLev.
Full Transcript
hi I'm Lucas and you're listening to gradient descent we started this program because we're super passionate about making machine learning work in the real world by any means necessary and one of the things that we discovered in the process of building machine learning tools is that our users and our customers they have a lot of information in their heads that's not publicly available and a lot of people that we talk to ask us what other people are talking about what other people are doing what are the best practices and so we want to make all the interesting conversations that we're having and all the interesting things that we're learning available for everyone out there so I hope you enjoy this today our guest is Nicolas come chatzky who is currently a director of AI infrastructure at Nvidia and before that ran Twitter cortex and was one of the first people to put deep learning models into production at scale so Nicolas thanks so much for taking the time to talk with us and I mean you're an expert on deploying deep learning in the real world and I would love to it you know let it kind of just hear how things have changed since you've been doing it I mean I think you started and doing this in like 2016 or or maybe even earlier Twitter like you know kind of what were the challenges then and and where the challenges now that you're seeing and making these models actually work Thank You Lucas I started like learning by the way about deep learning in 20 2014 so I'm not one of the old school I also do clowning but then I get hooked up pretty quickly yeah and yeah then I started a small startup with like you know five people or something I did and we were acquired by Twitter at Twitter we started this first deep learning team basically and so a Twitter we we we I mean basically Twitter didn't have any deep learning knowledge like they're not very eater and so we were associated with software engineers there you know that productionize deep learning on some product areas that could benefit from it you know what are the first areas where they felt like they could get a benefit was it vision stuff yeah kind of a mix mostly vision yes and text so we started with two main projects one of them was filtering kind of like bad bad image content you know haha so that was one and the other one that was more like a good a good product feature basically was deciding whether a user profile was safe to place ads on it or not and this is a big deal for that advertisers because they they wanted to make sure that they could put ads on profiles and make sure that these profiles are not like toxic or you know insulting or or like all kinds of account that you don't want to add next to and so we were able to classify those profiles using the text in the images using user features as well which allowed us to put ads on products as you can imagine this was revenue generating quite a bit and so yeah yeah and so so that was kind of like the beginning of a and so is this shifting from and it like an existing like kind of more traditional model to a deep learning model or like a new a new deployment of like a new problem no in those cases I mean they had been interest in those areas but but without deepening it was almost impossible to perform at the require accuracy so for example advertisers expect like 99.9% accuracy right it was unachievable just using tabular features and and like decision trees I mean I think it would be doable if one put the effort but much more complex and I guess like these sound like applications that you could do is sort of a batch process in the background like it doesn't need to run live on user queries or does it that's true doesn't doesn't doesn't have to you're right except potentially for for featuring of images so whenever a user of post an image make sure that those images are kind of like hidden right away for example you know for certain categories of user one thing that required a lot of a real-time processing was sketching kind of like very bad images I'm not going to go into details but and we wanted to redo that in real time before they hit the platform basically so in that case we would have a budget of like a hundred milliseconds maybe off even less than that right to be able to to get them it's like 20:16 so happy like how did you yeah get that working in real time in production like what right so that was those yeah interesting I'm sure you know the details of like typical in frame ones like the back then but basically there was a an O and blue a dot and so we were using okay Oh a cafe that's true cafe yeah yeah forget about caffeine we won't use in cafe although it would have it would have been a pretty good solution that case what you think you were talk and so what we did from weather training it's great however for deploying to production it's kind of a more difficult to what we had and and and wrapped it up into scalar services yeah that was that was so much a father so much have fought to make sure he was walking stable and so on so forth we run torch then in production or did you like a silent yeah yeah we were running across the production yeah Wow yeah that was a lot of a fault right so require so much engineering and you have 50 milliseconds to make a decision that sounds like a real real feat were you doing that or were you kind of handing it over to a different team no we were we were doing it internally right no no we we had to because it requires a lot of required a lot of expertise but that's right and what could make them go faster or slower the batch size all those things right so I said to do everything in the house like retraining live to or how did that work so we did that but later so that was the first part where we use deboning only for for images and text mostly right we get at Massif content for example and so on so fast but then after a little while we started looking at other problems that were more fundamental Supriya like ABS basement haha time is ranking things like that that are more tabular based right so like using user features item features and trying to make the best prediction of whether user is going to engage with some content and we also managed we use deep learning for that too and we managed to get better results than traditional techniques basically and so and so the reason why I started with all of that it because in that case for example for a displacement it's very important to have access to the latest and greatest features also to do online learning what we call online learning which is yeah like learning continuously or like with very high frequency because otherwise there is a quite like quick decay of performance right the decayed health life I don't know maybe something like one to three minutes right something like that and in the middle starts to decay yeah so we had to do online learning for that yes Wow see you would retrain every every minute so we we so there are multiple ways of doing it one way to do it is just to do online learning right away so just keep training yeah online you could you could also like freeze some of the layers and only retrain the last logistic regression for example ah but is this is the easiest one actually we can learn some kind of like why you know I know if you're familiar with wide and deep architecture for example huh yeah but yeah you could you could you could use the like memorization that I which is usually wear the DK happens and keep retraining that one basically and keep the other one a constant or I think some companies do that I think Google does that for exam I'm not so sure but they retrain regularly like every five minutes maybe so they have they take the existing model in prod fine-tune it really ploy function if Reaper yeah you ever go back and like sort of retrain it from scratch or is it always just sort of like online yeah we do mostly if we want to add new features already can you change the middle architecture right yeah how do you even um I guess how do you even evaluate then if like a new architecture is gonna be better like it seems like that would be kind of tricky right like yeah so we have to simulate the fact that it's online learning basically until in that case we there's to be like a time period where we say okay we stopped training and we look at everything that's after and then we we can we can evaluate by keeping learning you know it's possible to simulate the situation basically right yeah it's it's more infrastructure okay yeah Wow and so I mean this must have been an incredibly high amount of compute it's yeah pretty high pretty high back then yeah yeah I mean all interestingly old CPU dough I'll see you yeah Oh CPU because because GPUs well not I mean I work at Nvidia now but they were not that that easy to use in that context there was less tooling now it's changing with I know if you heard about rapid for example uh or so basically data science accidental GPU a lot of libraries available now but back then there was none of that so we had to just we accidentally code on sleepy you but you make it really really fast yeah with there other pieces of interest infrastructure that you had to build to get this working and in 2016 oh yeah 2016 well so you mean besides the inference ali actually yeah yeah I think for we had to father for the training part one of the challenges we had is that a lot of the deliver gas terminals were used to decision trees and certain API and configurations they were not familiar always do a touch few people are familiar with Lua in general so we cannot hide had to hide this and they're like you know configurations so we build infrastructure to basically simplify their life such that we could copy paste configurations right and just specify their features be seriously like ok these are the features I have this table like the steps I wanna run you know like training validation whatever and then and then basically uh yeah basically like automatically saved a mother and so so we brought a lot of like automation in the training phase at the cursor flexibility at the beginning then it changed but yeah and how did it change so then then once the company started realizing diya impacting the importance of this at Twitter yeah they did I mean we like they decided to also hire people who could understand and really invest in education that was that was one and at the same time we decided to I guess the centralized machine learning platforms in basically to move to tensorflow white and saw flow because bags and fighters are still very kind of like new and unstable not even 1.0 i think and so moving to tensorflow which also had like an inference story right which doesn't have and so on so fast that's what are so many recommender systems using terms of flow those days right because they have like this compute story anyway so we move to we move to tell pretty quickly after that for training and interest yeah I see and or training it is not enough reference known part of it just the library part you know just the the C++ library part but the thing is Twitter has their own Bella formats and sterilization formats so they had to play with that so for example thrift instead of a instead of any of them I guess are there any other sort of like surprisingly challenging things at that time like stuff that like no maybe like academics or people that don't work in these sort of large-scale deployments wouldn't know about like any other like tricky pieces so so you made it from from from from Twitter people yeah Twitter yeah I think I think in some parts there was a disbelief about around deploying no this is I don't like that's what you're asking it says exactly but and I think it still exists in the medical medical field for example well people asked for for interpretability explained ability and so on so forth yeah and even at the at the expense of better performance better performance but eventually that you know that it's so funny in in like 2005 I worked Yahoo moving models I'm like rule-based ranking systems to to boosted trees huh yes you know they had all the exact same complaints like the people were like oh these malls are not explainable they're like impossible to deploy like it's like exactly the same but now they're the same complaints about moving away from this issue yeah exactly well the difference for infrastructure is that we replicated almost the same API says when they as what Twitter heart for decision tree so it was a little bit easier there was already kind of ml ready right was in your case I guess it was like completely different right I'm strong yeah but it sounds like you had kind of add some weird components and sort of abstract away in the same way that oh yeah just obstructed everything away and made it very look the same basically you gain a gained adoption that was fun yeah so when did you move to Nvidia that was like your honor how to go 18 and so tell me about the stuff that you've been working on it Nvidia yeah so it's quite different in terms of the application domain however I'm basically managing the the team building the a platform to imagine the team building the platform to develop autonomous figure software so and in autonomous vehicle software I also include like deep neural networks right like all the validation required for it and so on so this is what I'm managing it's a pretty big endeavor the reason for that is autonomous vehicles are are such large scales so many there are so many people working on it and they're so I mean there are so many specific needs that we have to build relatively custom infrastructure in order to be able to you know to be efficient and good at it and in competitive and do you mean it's like custom infrastructure for self-driving cars or custom infrastructure for every individual team working on self-driving cars it's so it's it's the nature it's the nature of developing mudders and what we call perception so the ability to understand the world you know from the chaos of multiple murders press custom budgets and so on so forth developing this requires requires a lot of customization the cloud infrastructure basically is what is what I'm saying so as an example you know all machine learning teams use a waffle system in order to say hey I want to do this task and then do this task and then another task right and so on so far in the case of in the case of autonomous vehicles the big difficulty is that some you know they'll get like this the the dialer steps are gonna be in so many different languages so many different libraries so one is gonna be like data preparation using spark one is gonna be like oh now I want to run the actual software from the cow on the target Hardware right which is the actual embedded are well that's racked in the cloud but I want to run it on this user using CUDA and then I want to run a girl a container so all of these things are so different that they require a waffle system that's agnostic to all of this and that can be deployed and on 8800 general I mean bypassing the details but basically in some aspects we have to develop our own customize' in fresh Italian I got it it's a like what I guess you're sort of starting to talk about this but what I like to the big components of the infrastructure that you build and what are the like the big no problems at each component songs yeah yeah I mean at the at the top level so where we interact with our users we really provide tools and SDKs and libraries so that's the top level at the bottom level we have really component that anybody start lever so at the top level what we do is we go from everything outside the car so when people drive you know like drivers basically collect data or test a new build of the software system then they take out the the data or send it another Wi-Fi then it gets into the system it needs to be ingested so that's the first step ingestion ingestion is already pretty complicated because it's similar to like Yahoo or Twitter where you need to you know write heavy and then have somewhere to like process the data once in a word transform it into datasets that are more consumable by users we have to do that those challenges are pretty massive because we need to test for data quality for example all we need to index the data we need to process what it is or transform it into something that's easier to consume downstream as well and so on so far so that's the first step second step is to build the best data sets and that's actually a big challenge the way we approach it is that I mean I'm sure you you're familiar with that but we've you like machine learning as you know software 2.0 like as Kapiti I cannot laid it out I don't know if it was the first but well data is the source code of machine learning and so we need to be very careful about how we write our source code and you know not to do that we were developing to to curate datasets but I create datasets select the right friends or right videos with the right filters make sure there's no overlap between training and validation so we have a lot of tuning for that and so these are tools they don't actually do this they like help a user pick this or do they they somehow automatically like picked abuts yeah so both a QT also also investing a lot in active learning since you're gonna figure it I'm sure you have a lot of experience there yeah but I'm always fascinated by we published a blog post recently exactly about that basically well actively I mean autonomous Vickers is perfect and lends itself perfectly to actually on aa massive amounts of data but very costly human Liberty right right so if you want to do 3d cuboid laboring it's so costly however are there like you know thousands and thousands of hours of data of driving available and so we really have to select the one that's gonna be the most efficient and that's going to find the pattern that the DN n GP on network is not able to find and you know what to do that we use actually burning and actually uh new basically gives us like fun soft empty scars right and the frames are videos with the highest and certainty are basically the ones that we're gonna want to labor in order to improve the performance and so we tried it and we get like three eggs higher improvement three to five eggs actually higher improvements using active learning simple data versus manually curated not even read them manually so like by humans because like humans like guessing what Dane is gonna be the best yeah yeah exactly so they're like the challenge was that let's find vivillon use when venerable road users at night at nighttime and so it's a super challenging problem because of course for cows is difficult to review at night right with the camera and so on and in the general like pilot pedestrians and bicycles basically these are the two categories or so difficult to detect anyway so we sort of the idea was to detect this one's the adesso first first cool was like manual curation a group of people who have told look through the videos and find in you know fine images that are relevant for these classes and the other group was like just using the models for these specific classes find the frames that are the highest in sovereignty for these two classes right so one was completely automated and we were able to find a friend that we're very very and soft and basically for pedestrian and bicycle right which shows maybe twenty thousand then the manual curation did the same twenty thousand what they did usually is that the the swipes through videos and when they find pedestrians or bicycles at night did you stop and they select like you know a few friends in that segment of video uh-huh then we we train weather with these two sub datasets we looked at the validation performance and radiation performance was three times higher I mean the increase in were three times higher for active learning selected data yeah so it does work and you can be completely omitted if you think of it right well that's really impressive yeah that's amazing and that's in your blog post we should definitely get a link to that yeah yeah with you I mean we were impressed too really because there was just an experiment you know a research experiment and now we're working to automate that and to be able to even automatically selling that our retrained models and improve that form and say we could have a machine fabricating unit right yeah with human-in-the-loop just for the Humanities okay what else are you thinking what else LD so this data collection this one yeah Vince there's labeling which I'm sure you're very familiar with but yeah yeah figure it but yeah I thought my speakers down some pretty massive changes so first the scale scale is massive right so NVIDIA has a thousand plus lay below in India yeah we're doing it ourselves I'm software to actually be able to dispatch requests right to disturb where else and manage it as you probably know is quite complex because it has to deal with human human what frauds right and the way they behave so when they refuse when they make mistakes as one so files integrate quality assessment in the loop and so and so far that's one but also the tools observe the UI tools are pretty tricky so for example we need sometimes to be able to draw like poison but really to berate and make sure we can leak lidar data with image data right so we need to have a mix of like human labeling and like automated computing to be able to like you know link these two things for example not like build a new representation of the data that is then usable by by those humans and I think the two at the same time is pretty complex and difficult to do so yes so we we built all that so that's that's step number 3 basically enabling the data then step number four is about training so we've developed a lot of code to enable all or every developers to train their mothers one of the biggest challenges we have is that once we train we need to export this you know we need to inference on an embedded system and so we are compute constrain in a way I mean this is one of the constraint we cannot deploy like a thousand servers to be able to crush everything we need to use a single chip to compute everything so in order to to make that you know right we've we use multitask training for example well we have one single model body where that can predict much about things like things like path detection or obstacle detection or light sign you know intersection summits on and so forth all right I think this is similar to what Tesla is doing I know they've written a blog like they've done a talk recently talking about that uh-huh is that and then there's a lot of optimization such as pruning the models or in eighth quantization or new or like to take yourself that we can use in order to even further reduce the size of the size of the model with equal performance and so your your tools do do all of this like is there stuff left for like a perception team and a customer to do or like how do you think about that yeah yeah no no I mean we provide this as part of the as part of the the core libraries but of course sometimes they need to do something new when they need to do that they can you know add their own algorithm and so on so forth and then we we basically productionize it platform is it also you know what they're really focused on on the perception side is not really those features it's more like I mean they are looking into it that's very important we're also looking a lot into new types of predictions for example so they were editing by seekers now they want to predict more fine-grained things right so they're going to have Casas yeah that's you know they do a lot of those things basically that we don't have to care about we just provide the contraction time so you provide kind of core infrastructure to do multitask learning and quantization that but then the the customer sort of provide the different types of like classifications that they would want yeah exactly exactly but of course like there's a small event between the two and we help each other yeah do you handle things like like some of the newer stuff like trying to figure out like intentions or like try to actually like map out the sort of like underlying dynamics of like a person like where their arms are and head is and stuff like that is within your scope yeah I mean not not my team specifically but the perception team is definitely looking into things like that yes that's usually more on the research side you're a bit more advanced but yes definitely yes okay and is this so does this work with different ml frameworks or how does it is it like a lower level the matter how does that work yeah no no we we do work with the ml frameworks just because they provide so much value so yeah for for training specifically so we use them saw flow a lot by touch a little bit too I think it's mostly historical and then for deployment though for deployment we use tensor RT which is Nvidia's G plumbing in France library and what's great is that it's really optimized for NVIDIA hardware of course there's a lot of like it's also optimized fine France so you can do some optimization of the graph for example and yeah we did by using intensity so yeah and we get pretty big performance games with that cool weights is that the whole thing so you sort of a data collection no that's not all evaluation so the variation of the murder so let's say you train one model there's two take your detection yeah what you really want is like understand if the modification of that model finds and parodies how is that going to impact the overall system that's a very tricky system that requires a pretty fine grain understanding of the impact and so let's say we have this perception system that's a mix of kind of like post-processing Kalman filters new owner to our since one so France they're all mixed in pretty complex ways what we want to do is like have multiple levels of KPIs and pretty large rates of KPI to understand what's happening in this system that's the step number one so for example like first positive force negative shock right yeah the next level which is at the perception API levels like Ayala verne you know yeah how many mistakes do I make per hour for exaggerated of detection of a car and I also want to understand even further than that how do I Drive the car so which involves simulation in that case so we want to be able to run simulation jobs with this nucleus option system to understand like how the system behaves now that's the same simulation so we want to do all of this so we have a system to basically evaluate all of these things at scale together which is which is on the same infrastructure so like same data structures you know same - bones same kind of output data send analytics library and so on and so forth and the output of this is like all these KPIs plus what we call events the first positive is an event for example I'm can define an event as anything once we have all of this and the AV developers can look at all this information and this is this is the next step this is what we call debugging basically which is also like software 2.0 base right debugging the output of a predictor so we look at the output of the predictor and we can look at the KPIs wicked even plus all the events and then zooming on to all of these events and I very fine-grain look at them and the traditional versus the ground truth for example or like you know see if there's something missing when there is a lens flare so we can go very deep and then come back high and then make a diagnosis about what's going wrong about the system and this diagnosis is the kind of like how we improve the system this diagnosis tells us like I need more data on JP Japan at night for example and then we can go back to the curation step which is building better data sets oh yeah this is kind of like the then we the feedback loop that goes from debugging to this question step that that helps us to improve our perception system and so you basically can your user could like automatically request like you know give me more like you know like bicyclists in the snow or something and then make the curation step go out and look for more that or like wait that more or something yay exactly exactly I mean based on a based on what the curation can do which could be geographical conditions or maybe temperature if we have access to that to that type of sensor but but yeah definitely yeah that's amazing so I mean how do you I'm just trying to think like putting myself in your shoes like how do you approach like making such a sophisticated system on behalf of customers like do you like build your own perception systems just to try your own software how do you think about that at the bottom of this like and - and what flow I have core components basically which is our data platform and our workflow management system and those two things are powering everything right to be able to write et else be able to be able to register data sets for example be able to perform queries about data and make sure that all of these things are traceable and 2n which is a major requirement for the autonomous vehicles industry so that we then you know it has a problem in the future we can go back in time and understand everything that happened so anyway all those things at the bottom are powering the the top layer and are pretty yeah I mean pretty pretty beefy and made for skip it and so sorry the first thing is data storage the right yeah oh it's did uh platform data platform in that a workflow management system and waffle management system you gotcha and so the data what is it's a data platform is just it's like keeping track of where all the data is or no it's a bit more than that so basically it's all did all the infrastructure required in order to start structured data structured and unstructured data ah right so structured data it could be anything like I don't know like simple floating points continuous values and raw data is like all the sensor recordings you know right so we have all of this and we can organize it and the second step is we want to be able to query all of this at scale ice and so basically we you know we use have expressed overuse box equal and spark in general to enable us to do all of this and so this is what the data platform provides all those pieces and then the waffle management is more around like the ability to to to schedule like those complex computing data access tasks right uh-huh and stitch them together and so basically we know we can organize data in a certain way we know we can access a cluster but then make sure that yeah like I explained on yeah we can perform those graphs test and sometimes we require a lot of scale when we do evaluation at scale for example unlike thousands of hours of data and so we need a waffle system that enables us who to do all of this yeah interesting so I guess like one thing I didn't hear you say that I think a lot of people talk about is sort of synthetic data like is that is that interesting no it is for us in theory we have we have a simulation basically a simulation team I think I mentioned it for a testing for testing I wasn't sure if that was like totally synthetic simulations or yeah entire well I mean we we can do post on open loop which is no control and planning in the loop like no actual driving yeah we can replay existing data uh-huh so that's that's really good because then we can measure on real data but for data like closed loop which is really driving in a world we need real simulated data and this is when video kind of shines because we can of course generate like simulated world like in video games and so even more than that we have the ability to generate all kinds of sensor data for the car so not just not like also you know lie down type radar data but also like can I am you all those things that are just specific we can generate all of this and we have a special box that we call constellation which has this generator like simulation generator on the other side and what we call Tec you like the embedded system on the other side that can process all those sensor inputs in the same box so basically do the exact exact simulation right exact processing of the simulated data so we can do all of this and we can use it for testing and we can also use it for of course collecting data and training on data that just doesn't exist in the real world for example yeah so very helpful for bootstrapping perception efforts for example bootstrapping new neural networks right in so just um right right right right I mean what where do you plan to I mean it sounds like you have like almost a complete like end-to-end solution for P like it like could you and and and like with the car and some sensors and and like get a system that could make an autonomous vehicle for me yeah that's that's exactly yeah yes you can accept that we need I mean it's difficult it's difficult to change stencils as you can imagine because personally we use a different sensor we can have to recollect data to revalidate and retrain models or French in them x1 so files but assuming their likes given all to what we have or all that job we need to pull we can redo that work attire yes so I guess you're like the perfect person asked like what do you think is like left to do to me I mean I don't actually see I live in San Francisco so I do see autonomous vehicles driving around a fair amount but like what what pieces do you think are left to to really work on to make it like a real thing that that I would use every day so you don't use it every day's what you're saying why I actually really I feel like I'm in the industry oh I see do you do you I mean do you have tesla like moto three or any no I mean I've played with them and I think they're like very very impressive so I guess maybe that's a good point you're making that but but like what about some I guess like I guess again what do you think is is is the next steps with systems like what are you thinking of focusing on honestly I think so first I think this is gonna be pervasive and I think in the in the future everyone's gonna have like autonomous bigger functionality that's number one yeah I don't think the vision goes even further than that is that is that cows like will become soft well fine on the way to becoming so stratified and that and that's you know like people are going to see a centralized computer with a really nice ui/ux right and they're gonna be able to buy new software potentially to a gradient cut uh and this is already what's happening with Tesla yeah then talkin and it yeah and I think it's really one of the reasons why they are there capitalization is so high the valuation is so high yeah but the other chemicals also looking into that that model you know and I'm like interested in it and I think this is the future industry so for us I think this is all sort of you need to be ready for this world at Nvidia so that means having like a programmable platform an open platform right because we want to enabled all those chemicals or share one to build those cook those those systems together on the same chip on the centralized computer we don't want to exclude them basically from algebra we want to enable people to write software on our chip infotainment software self-driving software and so forth right and and now so driving is so difficult that we can provide it for them I know as a given application for for big chemicals or even smaller and and and and just and just yeah like develop it has an application from the but and then where we going with that I think is a matter of like performance and improving the control planning and the entire perception system I think we all still like at the beginning of it and we're gonna be able to do better and better and better over time by building a lot of automation first I potentially adding machine learning in areas that don't have machine on and yet such as predicting the the planning past for example right uh-huh I think things like that yeah and anyway yeah that's pretty much it I guess what I'm hearing is like you think that they're sort of like iterative improvement in a bunch of different things and then applying machine learning to planning is like the big like just sort of the next steps in making these systems you know work better I'm curious like what like what do you think of the like the stuff that people really like wrestling with right now to make otherwise really work I think the big the hardest thing is albin areas right now so like being able to drive in urban areas in New York City for example it's really really hard and that starts the next frontier and requires all sorts of new signals coming from the car you know like for example like intersection and slides like lack of lanes right things like that that can be very tricky or like unknown kind of like vehicles such as garbage collection stuff like that right so all of this is still a little bit newer and older older like self-driving provided or started with easier area such as highways except for some of the level 5 like lift you know like the ones that are trying to already leapfrog that but that's a big challenge video game yeah that's the next one jacket do you think like your approach Nvidia is significant different from my Tesla or lyft or how do you think about that yeah yeah I mean all these companies they are targeting different things so as a result there are some differences so lift is targeting level 5 they want to have fully autonomous vehicles Tesla is building cars so they don't they don't they no need to build a platform that's usable by other people for example ah right on outside will be a platform and we are where we make money with our hardware you can also make money with our software but all software has to be like usable by everyone else so we have to make it in a way that that is set right so this is one of the constraints we have as a result for example the platform we're building by the core infrastructure is designed in a way that can be potted in other like chemicals for example at all or any right like people developing so driving do you think your platform I mean it's interesting because the the all the the pieces that you mentioned of your platform I think like it's super relevant to like healthcare applications almost like any kind of deep learning application like you know how do you think about would you ever expand your platform to other applications or yeah I think I think that's that's that's possible some of these pieces are not require per se and sometimes the scale we aim for is not required as well for a for have scale for example yep and however yeah usually what we try to do is what we built is Maalik a superset of those tools and push the the front a little bit further now some things are a little bit tight to autonomous vehicles but the entire end-to-end waffler though seems very applicable you're right the towards themself that would be sometimes we are customized to the data we have so yes we could extend them it would just require some work basically yeah I think I'm curious this is a kind of a specific question but you know I've been thinking about this lately like how important do you think is the the sort of Piper parameter search piece like the neural architecture search you're talking about like is that is that really essential or is that like a nice-to-have so uh you wanna click your search is is right still it's still like something we're exploring I think it can be important because we can really reduce the the compute footprint ah for us so I think it can well so for exemplary we can constrain the search space around you all architects yourselves to something that's going to perform really well on target Hardware in terms of latency right right because invidia is like some hardware accelerators that are specific and so we can make sure that we can get this and find the architecture so and so far however I prefer metre search is something that we have available but the the kind of like the advantage of using data is the computed requires is often like not super interesting for right for for therefore for developers so we do it sometimes but it's not really like a big a big advantage I would be competitive advantage I would say for us I think yeah actually is there a piece of your I mean your platform sounds amazing and it solves like you know a whole slew problems is there like a piece of it that you're like especially proud of that you think is like really like like really stands out to you is like best in class or I really like the actual running part and everything that goes around that because so one other thing we are doing is what we could target in learning which is the ability to take perception bugs so like oh I'm not able to detect you know like trucks in that position whatever and then use that and simple data set that then is going to be used for training and fix the caption bar uh-huh right so and and doing that is similar to active learning but like condition I conditional active money so I'm really proud of these two things because I really loved the automation of it all right like you we could just go on vacation and be like okay now just the system worked we have like you know customer sending their bugs and automatically we just fix them you know cool well this is this is so fascinating actually you know even if we were recording this for something I mean I love talking about exactly yes it's great to meet you thanks so much dancer that's cool well thanks a lot all right that was such a great conversation Thank You Lucas and Nikolas I'm gonna add a link to necklaces twitter in the show notes below and I would highly recommend that you guys check him out also if you'd like to continue the conversation we do have a very active slack community with over a thousand machine learning engineers and I would love to see you guys on there finally before we go I would love to talk to you guys about something that I'm super excited about so lately we've been working with a lot of self-driving car companies at rates and biases and that means that we've been building native support specifically for self-driving machine learning models so now with just a few lines of code you can do object detection with 2d and 3d bounding boxes with inmates and viruses you can also do semantic segmentation so you can compare your models predictions with the true labels inside your data set and finally my favorite you can now log point outs for then rates and biases so that means you can now use point outs to understand your seam with custom annotation layers you might use this with something like a data set of lighter points so for example let's put out a self-driving car data set composed of lidar points and you could plop that in to weights and biases and draw nice little 3d bounding boxes around your cars people and other objects within your within your theme it's a great time I'm gonna leave some links in the show notes below so you can try out point out semantic segmentation and also object detection it's a really fun time whether you're working on self-driving professionally or just for fun I would love for you guys to try it and tell us what you think finally you can also use weights and vices to run sweeps are to tune your hyper parameters in a very organized way this means but you can just give us a list of hyper parameters that you would like to search through and also a search strategy and then we will go through and train all these different models and find you the best one in a very organized way which is very low effort on your part I'll also leave some links down below for you to try out our sweeps that's all for today we'll see you in the next episode have a nice day
Original Description
👨🏻💻Today our guest is Nicolas Koumchatzky.
Nicolas Koumchatzky is the Director of AI infrastructure at NVIDIA, where he's responsible for MagLev, the production-grade machine learning platform by NVIDIA. His team supports diverse ML use cases: autonomous vehicles, medical imaging, super resolution, predictive analytics, cyber security, robotics. He started as a Quant in Paris, then joined Madbits, a startup specialized on using deep learning for content understanding. When Madbits was acquired by Twitter in 2014, he joined as a deep learning expert and led a few projects in Cortex, include a real-time live video classification product for Periscope. In 2016, he focused on building an scalable AI platform for the company. Early 2017, he became the lead for the Cortex team. He joined NVIDIA in 2018.
🐦Follow Nicolas on twitter: https://twitter.com/nkoumchatzky
🛠Maglev: https://blogs.nvidia.com/blog/2018/09/13/how-maglev-speeds-autonomous-vehicles-to-superhuman-levels-of-safety/
✍️Scalable Active Learning for Autonomous Driving: https://medium.com/nvidia-ai/scalable-active-learning-for-autonomous-driving-a-practical-implementation-and-a-b-test-4d315ed04b5f
✍️Active Learning – Finding the right self-driving training data doesn’t have to take a swarm of human labelers: https://blogs.nvidia.com/blog/2020/01/16/what-is-active-learning/
Topics covered:
0:00 intro
0:42 Nicholas intro
0:52 how has deep learning shifted since 2016?
11:52 Surprisingly challenging things at the time
13:15 moving to NVIDIA
15:36 components of the infrastructure at NVIDIA, active learning
28:53 How do you approach makiing sophisticated systems on behalf of customers?
31:24 Synthetic Data
33:55 What is there left to do for the progress of autonomous vehicles?
38:02 Difference at approach between NVIDIA, Tesla, Lyft
41:11 More on Active Learning
Weights and Biases makes developer tools for machine learning: record and visualize every detail of your research, collaborate easily, advance the
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Weights & Biases · Weights & Biases · 42 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
▶
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
0. What is machine learning?
Weights & Biases
1. Build Your First Machine Learning Model
Weights & Biases
Intro to ML: Course Overview
Weights & Biases
2. Multi-Layer Perceptrons
Weights & Biases
3. Convolutional Neural Networks
Weights & Biases
Weights & Biases at OpenAI
Weights & Biases
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
4. Autoencoders
Weights & Biases
5. Sentiment Analysis
Weights & Biases
6. Recurrent Neural Networks [RNNs]
Weights & Biases
7. Text Generation using LSTMs and GRUs
Weights & Biases
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
Introducing Weights & Biases
Weights & Biases
10. Seq2Seq Models
Weights & Biases
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
14. Data Augmentation | Keras
Weights & Biases
15. Batch Size and Learning Rate in CNNs
Weights & Biases
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
Testing Machine Learning Models with Eric Schles
Weights & Biases
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
Reformer by Han Lee
Weights & Biases
Sequence Models with Pujaa Rajan
Weights & Biases
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
Made with ML - Goku Mohandas
Weights & Biases
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases
More on: CV Basics
View skill →Related Reads
📰
📰
📰
📰
Regularization: How a Small Penalty Helps Neural Networks Grow
Medium · AI
Regularization: How a Small Penalty Helps Neural Networks Grow
Medium · Machine Learning
Regularization: How a Small Penalty Helps Neural Networks Grow
Medium · Data Science
Regularization: How a Small Penalty Helps Neural Networks Grow
Medium · Python
Chapters (11)
intro
0:42
Nicholas intro
0:52
how has deep learning shifted since 2016?
11:52
Surprisingly challenging things at the time
13:15
moving to NVIDIA
15:36
components of the infrastructure at NVIDIA, active learning
28:53
How do you approach makiing sophisticated systems on behalf of customers?
31:24
Synthetic Data
33:55
What is there left to do for the progress of autonomous vehicles?
38:02
Difference at approach between NVIDIA, Tesla, Lyft
41:11
More on Active Learning
🎓
Tutor Explanation
DeepCamp AI