Machine Learning Engineering for Production (MLOps)

DeepLearningAI · Beginner ·📰 AI News & Updates ·5y ago

Skills: LLM Engineering90%Prompt Craft70%Fine-tuning LLMs60%

Key Takeaways

The video discusses Machine Learning Engineering for Production (MLOps), covering the role of MLOps in industry settings, its implementation, and the skills required for MLOps, with a focus on making machine learning systems production-ready and adapting models to changing data environments.

Full Transcript

welcome my name is ryan keenan and i'm the director of product at deeplearning.ai we really appreciate you taking some time out to join us for this event we've got people all over the world joining us right now so what we want to talk about today is that the field of artificial intelligence has seen incredible developments in recent years the performance of machine learning models and the relative accessibility of powerful computers means that nowadays almost anyone with a little bit of coding skills and access to the internet can piece together a really powerful implementation of at least some sort of proof-of-concept solution ranging from things like computer vision to natural language processing and much more that said the gap between any proof of concept solution that you might put together and a production-ready machine learning system is really a substantial gap and that's what we're here to talk about today machine learning operations or mlaps for short is an emerging field within the ai space focused on closing that proof of concept to production gap the field of mlaps brings together the fields of machine learning so data and modeling pipelines with the field of modern software engineering so things for like from devops to deployment so today we have with us a panel of mlaps experts to discuss what are the most important aspects of production machine learning and what ml ops looks like at companies today from industry giants like google to small startups we organized this panel to celebrate the launch of the third course in our new specialization titled machine learning engineering for production and mlaps we've created this set of courses to help learners gain practical knowledge and skills in this exciting new field you can learn more about the machine learning engineering for production and mlap specialization at our website deeplearning.ai or by following the link in the description below on the specialization landing page you can also find a link to our discourse community where you can connect with other learners and mentors and where we'll be hosting an ama session soon with one of the course instructors so with that i'd like to welcome our panelists for today's discussion we have with us rajat manga co-founder of a stealth startup and former lead at tensorflow at google we also have chip hullian adjunct lecturer at stanford university we have robert crowe tensorflow developer engineer at google and one of the instructors of our specialization lawrence moroney who's an ai advocacy lead at google also one of the instructors of this and other specializations at deeplearning.ai and our very own andrew um founder of deeplearning.ai it's a great pleasure to welcome you all to this event today and so for the next 40 minutes or so i'll be asking some questions to our panelists and uh while i'll be directing my questions to a specific person among you i'd like to encourage you all to jump in and offer your input on the conversation we'd like to make it a free-flowing discussion so we'll start off by just a little bit more introduction from each of our panelists so let's begin with you raja could you tell us a little bit more about yourself and your thoughts on ml ops to get us started yeah definitely so uh as you mentioned i spent a long time at google most recently you know helping build tensorflow and then run it for a long time it's great to see you lawrence and robert i guess we spent a long time together um since you know over the last year or so i left google to and i'm working on a startup that i'll reveal in a short while uh but going back to the topic of mlaps something that you know i spent a long time with with tensorflow and otherwise uh you know the way i look at it is some things that you alluded to ryan where we've had this modeling where you're defining a proof of concept you want to see that oh is it going to add value to whatever you're trying to do perhaps you see okay yay it's giving you some predictions they seem valuable now how do i use it so going from that point all the way to using this model in an application every bit that you need is part of mlaps i would say uh there are obviously lots of different things that we go into today but just at a high level that's what i wanted thanks very much raja uh chip let's uh hear a little bit more from you in form of introduction and our thoughts on mlps hey uh so my name is chip um uh i'm teaching machining system design stand for it and i'm also part of very small team that is building infrastructure for real-time machine learning so as in um online predictions and continual learning um previously i was helping um so so my background in tune emma opps is actually starting um from the first ml course was i talked with andrew um i was so he was like the reason why i got stuck in all of this not stuck i mean it's a it's a good one um so i'm extremely i'm a bit nervous right now because everyone here is like i have been looking up for a long time so i'm just like very excited and a bit nervous as well um so i started um in a more research background and then i did it more when to apply research and nvidia we will build tools um for companies to experiment with a new models but then i realized that oh i think the problems of course there's a lot of problem with doing research but then so like a lot of smart people working on it and i think that's an under address problem is with like how you bring these models into initial productions so that's why i started joining the startup uh like snorkel to work on it and then now i'm focusing on entirely on like how she like um so also i believe that like training is a small part of the problem the problem is retraining like once you have the model out there and how do you keep on like updating it to because like you know have heard of things like data drift concept drift like the models performance degrees in real life then how do you address that problem so that's what i'm focusing on so i think it's part of emma off as well yeah absolutely thanks chip uh robert um could we get a little introduction from you hi i'm robert crowe i'm a tensorflow developer engineer at google and work with lawrence and and used to work with rajat um and uh so ml ops um first of all i'm not crazy about the term in the laws and there's a lot of disagreement you talk to a lot of different people everybody has a different definition of what it is but essentially for me it's it's the idea of taking a model and really creating a product or service out of it and all the issues that come up when you try to do that like drift as as chip mentioned the problem of getting labeled data not just for your first data set but for continuing data sets as you retrain your model things like privacy and fairness and serving different customer sets you know well things like resource optimization all that stuff comes up in a production ml setting that you don't have when you're doing you know research or or academia or what have you so for me that's the focus of ml ops is is making it possible to create and sustain a product or service responsibly thanks robert and thanks for getting us started with the debate because i hope i hope we can have more debate about just what this field is and what the most important pieces are but lawrence please look at this agreement [Laughter] uh lawrence let me give you a moment to introduce yourself and disagree with robert so yeah i'm i'm lawrence i uh i lead the ai advocacy here at google and i'm actually really excited about the fact that we're having these conversations now and we're having these conversations to try and figure out what exactly it is we mean by model ops or ml ops or is it a is it a function of devops or what it what exactly is it because if you just think back just a couple of years um all of our focus was like how do we build a model you know how do we hyper parameter tune how do we you know what type of algorithm should we choose what type of modeling like architecture or neural architecture should we be working with but now the conversations we're having is like well how do we use this effectively in production how do we think about things like the serving infrastructure scaling process management and all that kind of thing so that's just to me by the fact that we're having these conversations now shows that as an industry we're taking great strides forward in bringing ml and bringing ai you know to the to the general populace and the and to the general application developer and um whether it's a branch of devops or whether it's something in and of itself i think is the kind of thing that we'll discover as we're going along on this journey so yeah it's just great that we're doing this now great yeah glad to be here um well andrew put it over to you to wrap up our introductions well you know listening to to everyone's comments as a reminder when um i was once leading a speech recognition team and the engineers the machine learning ages and speech fraction team uh had a great result you know very accurate speech on the test set uh even better than human level performance so they went to the business product owners and said look at this great speech system were built you know i don't know give me a raise celebrate my accomplishments and the business product owner said well we tested it it doesn't work this sucks look at all these users miss transferring hardware transcriptions and then my machine learning team said no we did well on the test set you know therefore logically it must work and and it was difficult to move the conversation forward i think over the last decade um thanks to the rise of deep learning and other things uh we collectively have become much better than before at doing well on the holdout test set which is fantastic celebrate that uh and i think that if you want all of you watching this if you want your fantastic machine learning work to make it into valuable production systems as well i think emma offers this nation few they're all trying to you know invent collectively that i think will help everyone through the entire life cycle of machine learning project from you know scoping to collecting and managing the data to training the model and improving the data improving the model to then deployment monitoring managing concept of data-driven model maintenance i think uh ml office is exciting nascent discipline solving that entire life cycle of machine learning project and one one thing i'm excited about this is uh you know when when i teach on right teach online on tv and coursera a lot of things you know we teach are well try true concepts that that's widely agreed on i think ml ops and machine learning production it is very much on the cutting edge so i find that exciting absolutely great well that gets us started off in the right direction where we wanted to begin with this uh discussion of the mlaps in the first place is there is debate there is uh so mlaps is short for machine learning operations or as lawrence mentioned sometimes you could see it as model operations but that seems to be something differentiated but anyhow mlaps is really a nascent field and so it seems like it would be worth talking a bit more about just what this field is and and perhaps having some debate and disagreement over what it is but also to what extent do the roles the current roles of data scientists or ml engineer involve mlaps so we'll go around the room here and uh robert since you kicked us off in the debate i'll i'll give you the the first say on what it all what it all is what it all means um well so one thing i try to emphasize but i'm trying to explain i like the term production ml what i'm trying to explain that to people is is the fact that the world changes and your model when it's trained as a snapshot in time of what the world was when you collected your training data and depending on what field you're in that could be fine for maybe quite a while but just like human beings if if you don't adapt to change you don't do as well and things change around you and you're not prepared to deal with it it's the same thing with a model and in some domains uh markets are a good example you you if you're trying to use a model to to make any sort of market prediction markets change you know within hours so that means your model needs to learn to adapt to those changes and that usually means collecting new labeled data and retraining your model so understanding that and the whole the the domain i'm going to lost everybody um the domain knowledge around that is is incredibly important are we still on we're we're still out i think we had a little blip there but um thanks robert um rajat let's uh let's hear from you uh yeah so i i think that makes a lot of sense what you said robert around data is not static what we're doing is not static so the models which are a representation of the data do have to change as the data is changing as the market or the world around us is changing so so that's sort of a you know partly a data centric view let me give you or you know provide a different software more software centric view in some sense if you think about models they're also just functions traditionally with software we've hard coded those functions you know used to be heuristics or whatever is being replaced with predictive models today now instead of writing that software and hardcoding it saying this is what it is and you know obviously that's going to be a lot harder to change or learn or fix we're saying okay let's just learn that function from the data and use that instead now if you take that software centric view a bit then this is just a function in your overall program that you're building and suddenly you start to see oh if you were to think of you know not to mix it with devops but there are a lot of things that we can learn from there if you think of all the software engineering principles that you've thought of that we just don't have in ml yet i think there's room to really learn a lot from there and take that here and apply that of course that's not the end of it i don't think we can just apply those blindly exactly for the reasons what you said robert which is uh this function is not static it has to change over time for certain kind of things where you know you have static images you just want to identify flowers yeah we're not going to get too many new species of flowers in the next year or two but on the other hand in a business where you're relying on customer data or anything like that things change all the time and you want those models to be updated all the time absolutely thanks very much uh chip i'd like to get your perspective on this so yeah i think this is a very interesting interesting discussion um so i think it agreed a lot with what robert and rochester said so i think what i mean what i heard from boards is that um ml ops is like peculiar because um machine learning is not just code like you can't just like finish your code and do a lot of testing in um in in-depth and then you deploy it and send the dev also going to take care of it as they can monitor like system performance uptime and all the classic um devops metrics but like a machine learning model is like part good part data and data changes in in real time so so so maybe like we don't have like a new uh type of flower to predict but like there are a lot of uh problems when i have new classes all the time so it was working with the e-commerce company right and and they wanted new problems to categorize their product and i have new live products out on the time and it's like it's changing not like from like um every once every week or a few weeks it's like but i can be like a few times a day so so like it's very very very challenging a challenging problem and you can't just like hands on models to deadpool's people and since like hopes that like things turn out to be the best fault for the best but even so um you want to moderate not just like the system metrics but you just want to model how the model is doing because every time the data environment changes the motor performance changes so like uh now although people could have training monitor on studies and like data uh we mentioned that already like this um like concept drift and stuff so so like that most people might not be equipped to do it and now we have to bring in like machine learning uh people should i do deal with like things in production and maybe machining people are not like equipped to handle all the dev ops tools so now we have like multiple teams from different um domain experts so i try to work on the same workflow so how do we like help them like uh talk to each other and communicate effectively because everyone has different lingo and have different expertise so it's a depth of my options so about like bringing people from different disciplines different um ideas you like work together to solve like a common problem thanks chip yeah so you're you're mentioning something that seems like it's core in people's minds when they're thinking about this if they're you know maybe they were thinking about wanting to be a data scientist or wanting to be a machine learning engineer and and then how does that tie in with this whole this whole world of mlaps um lawrence i'd like to get your perspective and also you know what what does it mean for someone who's on a team and and how they actually interact with the whole pipeline yeah so i think one of the things that i just like to highlight was something i think rajat was saying and it's uh to me the one of the more exciting things about ml and being able to use ml in production is that you design your systems fundamentally differently than you would on older base production systems which were entirely code based and in particular that gives you the opportunity by being data driven to give you a um sorry is my audio working okay i just saw a message pop up it's a little rough yeah but um we can hear you okay um what is just saying is that the um by having ml based systems what we can do is um frequently update our models frequently update our functionality in ways that would be much more difficult to do in traditional software really systems so with that ability to do frequent updates then we can serve our customers better and then that gives feature engineers that gives ml engineers um different workflows for how they can like i say start their customers better and we have one model with google is that if you focus on the user all else will follow and being able to do ml ops in um sort of being able to do ml in production and then having an mlr system behind that so that you can focus on your user that gives you new opportunities and new skill sets for your ml engineers and others to follow absolutely thanks thanks very much lawrence um and and andrew i'm going to read you i'm sorry sure no problem um andrew i'd like to get your perspective on all this yeah look great great great function i want to add one one additional observation to the perspective um which is uh the importance of entering the data and the lack of tools for doing that systematically so for example chip mentioned working on e-com one econ problem i was chatting friends about the other day was um how do you decide whether the label products as your hazardous or dangerous rights have a two-year-old daughter so you know she's inventing new ways to kill herself every day which is gonna not let her kill herself but so you know so and it turns out that when you get people to um label products that you buy on an ecom site is this hazardous the children there actually isn't universal agreement it's actually really difficult to judge what is hazardous and what isn't and so i think that um one thing i see in machine learning is previously we used to have you know people like us or like any of you watching this hack around the data set for six months or nine months and maybe you will kind of figure out a way to label it or maybe throw up a hand and say you know what i don't know what has this let's just get three people or five people to label everything and take an average and hope for the best and i think those processes maybe they work if you have a giant moving farm or something but yet even that doesn't work that well and um i feel like uh if if the the nascent ml ops teams can help with data preparation as well in terms of deciding what data do you want to collect do i need more products in this category label or that academy what are the standards of labeling the data um i think that there is a big gap today in tools to engineer the data so that when you feed the code you get the performance you want and i think that is important in addition to the training and the model of the duration as well as the post deployment your monitoring and maintenance thanks andrew yeah so now we have some of the data centric perspectives about software centric perspective and and some other ideas uh from everyone i'd like to to take a moment then to think about what does this actually look like in practice so you've all worked at a variety of different places and i can only imagine that this mlops pipeline or mlaps infrastructure looks quite different at a place like google that it would look at some small startups so maybe first i think it'd be really interesting for people to hear a little bit about just what does ml ops look like today at google uh well let's start with you lawrence okay hopefully you're hearing me okay now um so i think one of the things that how it would look like at google is it the easy answer is it depends so in many ways because ml ops is such an excellent thing that um there are so many different options for you to be able to design an mla space system that it really depends on the system that you're actually building so one of the things that we i want to kind of look at when at google is to say okay we have this little box which is our ml code but one of the things that google is how do we scale that uh how do we make sure that you know we can focus on billions of users instead of thousands of users and so a lot of the serving infrastructure that has to be in place for scaling is you know this similar type of skills that traditional ops folks want to have then it comes to monitoring uh of course it's very very important for us to make sure that our systems are up you know we have whatever it is five digits six digits uptime so to be able to build a decent monitoring infrastructure to make sure that our models are running inference um at the required parameters at the required speed those types of things that you know we do want to have a good monitoring infrastructure in place but because there's so many of these different dependencies serving infrastructure monitoring scalability process management machine resource management data verification one of the important things is that there's so many people and there's so many moving parts to be able to keep all of these uh working together that we want to make sure that we have flexibility between these we have standards-based interfaces between these we have open systems as much as possible between these and we generally design our infrastructures that way and out of all of that then came the effects so a lot of the things that we've learned about building systems for ml ops you know we really focus into tfx and i know robert's the real expert in that farm organized so maybe i know it's in him here well thanks laurence so um i mean i'm thinking about the small companies that i've worked in and and how to contrast that against google part of the situation at google is historical because we were doing large-scale ml really before tools were available in the outside world so we invented a lot of things like tensorflow and before that disbelief in order to accomplish the you know the goals that that the business had and there's still a lot of very powerful but very bespoke tools that are used in our ammo ops infrastructure within google and gradually where we're taking those and moving them into open source things like uh kubernetes for example really came out of the framework that was originally developed inside google to manage containerized applications or as as lawrence said well both tensorflow and and tfx came out of you know that work that was done but because of an internal need and i see this a lot at different companies where they've because they needed something and they couldn't pull something off the shelf they've invented it and it's caused a lot of problems for a lot of people because it's now something they have to maintain so they're really anxious to try to adopt industry-wide you know available things that that are supported by communities things like tensorflow and and tfx that helps them with their burden and it moves things forward as a community and not just you know their little team so um i don't know did i answer the question i was kind of rambling there yeah yeah i think that gives some interesting perspective um it and so people are inventing tools to solve their own problems because oftentimes the tools don't all exist um rajad you're you're at a small uh stealth startup at i'm assuming it's small uh but uh you know are you inventing your own tools what does it look like there uh so so yes and no uh you know having seen the google side of things and seen the craziness that on answers when you try to build everything from scratch no i don't want to build things from scratch where possible uh that that said you know having been you know really involved with tensorflow and thinking about where it's going and how it helps people and with tensorflow extended uh clearly want to leverage what's out there and there's a lot of value in being able to use these uh standardized tools where we can so uh you know as we grow i think we'll use more of those where we're starting out the you know what i'm a firm believer in is uh really get that into anything working and then try to improve those so what does that mean uh for us the first thing was okay let's prove it out let's build the model that's what you know you need to do before you can build a pipeline around it of course even to build a model we had to do a bunch of data processing and stuff it wasn't like you could get to the model without doing the data processing and getting things in order uh but to start with each of those was more bespoke you know the data processing was okay a custom handwritten thing because we had a small data set that we could just process and that was fine then we had to start scaling it out we had to run it across multiple machines now what do we just scale it out using say kubernetes that came up here do we use spark you know those are the kind of things that come up and how do you then once you have the model and now you want to deploy it how do you start to put those together as we are picking each of these pieces uh of course for each piece individually we want to use as much of the standards stuff as we can being in the long run i think being able to use something like transfer extended is the right way for us to go uh connect it with the right kind of you know orchestration systems perhaps like airflow or whatever so we can run run all of these and uh make all of that work but to start with for a startup i don't think it's necessarily the right thing to pick all the pieces together and start with the the biggest thing you want to pick and choose add things as you go along but try not to build those from scratch there is a lot of stuff out there today that we can leverage and at our end we are definitely doing a lot of that thanks go ahead andrew yeah i i want to say something that made they made simultaneously annoy both robin and lauren so we'll see but robert mentioned um uh disbelief uh uh and and then tfx and and frankly i i remember right way back working over jaz uh and jeff dean and others on disbelief and candidly you know i saw they made a ton of terrible decisions right you can blame me for all the bad decisions made back then good for jet and then jeff to credit for all the good decisions but but i think that uh you know i mean it was two cpu century with these giant crazy c plus plus linearity for our matrix multiplication things trying to implement the street cpu there's lots of bad ideas you blame me for um and and then later right yet you know led the development of tensorflow and that that that kind of mixed version was much better i think even though tfx is still the shiny thing today part of me wonders if we're in the disbelief uh sorry disbelief was a precursor to tensorflow but part of me wonders if um uh collectively you know even tfx which say the odd is the disbelief of of of deep learning 10 years ago where is the best thing out there but there's something even much better to be invented um late last night i was starting a google doc where i was writing down like frankly a bunch of research ideas to share with one of my friends on new ideas for how to engineer the data because i was struggling literally yesterday afternoon i was looking at some labor some label image data the labels are junky but they're about 5000 images it was yeah really hassleful to just browse and figure out what's going on and re-label the data so i was annoyed so last night i was thinking okay maybe you can develop a learning algorithm to make this much better so i relevant google doc with three ideas to share one of my friends ask him he hasn't gotten back to me yet if so i feel like when this there's so much stuff to be invented um again ten years ago when we shifted collectively the community shifted the world to uh uh to deep learning i did not understand at that time how many tens of thousands of novel inventions and research papers and so on would be needed and then yeah there came disbelief in tensorflow and other frameworks that lay the foundation i think today as we think about mlrs and centric ai i think there are easily some tens of thousands of ideas to be invented and new frameworks and i think and actually and i know right chip's been busy out inventing these as well some mutual friends like chris ray alex ratner you know there's a lot of people just nibbling at this i think it'll be a big movement yeah i think that's a great um segway actually chip i wanted to hear more from you on before stanford you were at snorkel which is of course um a tool for some of these mlaps data pipeline stuff that that might be part of some of this collection of tools that people are using for for what they're doing at their company you've also written a post that that that captures what are all the tools out there what are some of the trends um i'd really love to get your perspective on on some of these things is are the are the tools really out there now or is is much of it yet to be invented where what do you think things are um so i think before um get there i just want you to take a step back um so so i think the requirement for tools right now uh actually depends a lot on the company size so use case and also so maturity so i think i tend to think of like ml ops um adoptions based on two axis one is from like the size of company nice very small like small startup like agile and then have like two large company and i think like the adoption is like look like something like this so it's very big company like google netflix alibaba uh they say like have incredible and infrastructure engineers they do incredible tools and they move very fast and they have state of the art ml ops tool they have like a smaller startup and who like um very very agile and they like um eager to adopt neutrals but they are like a lot of companies in the middles who like very interested but then like they are boggled down by uh either the lack of like really good engineers because also really good engineers like to join like google and stuff i don't know like and then like they even saw the buckled up by legacy systems so like they want to update uh first of all a lot of companies like once you like switch to like do real-time machine learning but they like above when the system is set up to do like best jobs like they can't really easily switch to live streaming so like a legacy system is one thing um and then they have like adoption maturity like it's like the longer the company has been like adopting machine learning like they um the um uh the more effective they are and they can um and they're more sophisticated so in the beginning that you just want to go from like no machine learning to machine learning so they just want to have some tools to help you get there but then like once you have a couple of things in production you you kind of want to like um make the most out of it so you worry about returning monitoring and like uh squeeze out like the last drop of performance so like and this company require different tools so sorry i'm just oh yeah i'm sorry i like totally like off track here and i taking on every time but but yeah so um um so um so yes the trunks depend on it and uh i think i think for the um because of like so many white requirements it's really hard to have a standard pipeline for every company that works but i do hope that in the next like five years or something like when everyone has been at the same maturity level then we can have more standardized uh pipeline yeah i wanted to follow up on that also because you're teaching this stuff at stanford so um when you're when you're telling students uh about how to think of all this and also in the article you point out that while modeling and achieving the best accuracy with your model was kind of the bread and butter of research in academia some years ago it's evolved to more interest within research regarding ml ops or ml infrastructure could you say more about that and what do you what do you tell your students so this is like uh actually when you're stitching the course it's uh very hard to come up with what you take to a student it's not that like they are not things she thinks it's just like uh so so currently a lot of ops is very true and focused right they have a pipeline and say oh yeah you started like using this tool for modeling this tool for monitoring this tool for bringing to productions right but it's a problem with tones of tones like evolved over time like um that you have like new packages coming out all the time have a new version coming out all the time you want to teach student tools like their knowledge like what it takes to be outdated like as soon as uh the new tones update so actually i learned that lesson because the first course i talked with tensorflow and the first version was in tensorflow 1.0 it's like oh no this is like a graph based right and then you switch like rajat i blame you like you're switching to like 2.0 and like eagle execution i was like oh my god i need to update i have of my materials it's like it's just so much work so so i want to try to like take more on like on like best practice like on the mobile philosophical level but then students just like oh you can't you can't really learn how to do mls by just like looking at philosophy like always like best tense principles so it's like it's very hard to strike the right balance um and it's of course still evolving so i think it's also might be a challenge for for students or like engineers trying to get into like ml ops in my productions it's just like you don't want to like just learn about tools uh but even certain ones you just learned at a high level either so i'm just very curious to see the course that er emma specializations uh by course by deep learning to see uh how you guys like tried the right balance yeah i i think that um it was definitely something that took uh some evolution along the way um but but also you started the you sort of started us thinking about apart from courses or apart from theory what is it that people should be doing if they're trying to prepare for this field i think a lot of people watching today are are thinking they'd like to prepare themselves to be a good candidate for a role in the sort of mlaps world rajat could you uh could you extend that yeah so i i think uh onship had some really great points there on you know how to think about it and stuff but one of the things that she said was you know it's hard to teach and learn at a philosophical level saying okay this is how things should be in general uh that's great but how do i apply it how do i think about it in practice and stuff uh and again coming from a more software centric view my perspective there has always been okay if we have ways of we believe that this is the right way to do let's say data labeling or this is the likely to building models or this is the right way to deploying models can we build software to simplify that that's where the tools comes that's where it tends flowkey and that's where it tends to extend it game and all of that how do we codify that so it's easier for the next person where they just don't read a set of principles that here are three things to read and do you actually have software to do those you know and yes it's none of this is static things change over time tensorflow went from version one which had certain things to version two which evolved and learned from that and applied a whole bunch of new things as well uh and i'm sure you know at some point we'll see text flow 3 or something new as well and totally we should continue to iterate and live with that by the same time for people to be able to leverage this and uh be able to do more ml i think they need good solid tools if i you know going again compare comparing to software where we were like where we've evolved over the last 40 years clearly ml it hasn't been around that long ml in as a field has been around that long but not as being deployed in products and stuff so there's a long way for us to go to really catch up to a lot of those and learn from them absolutely and so robert you're you're uh the instructor for much of the machine learning engineering and production and memolob specialization that we're putting out the third course of today um lawrence you're also one of our instructors uh you you two um have some perspectives as well on if people are taking these courses what else should they be doing um lawrence what are your thoughts it looked like you were about to see something yeah sure um hopefully you can see and hear me now and i followed some good ops principles and i had a backup ready uh so um i think one of the things that's uh really interesting and then just going back to stuff that andrew and chip and rajat said um is that you know it's something like tfx it said version 1.0 now and we're dealing with 1.0 stuff and if we think back to other technologies anybody remember windows 1.0 or java 1.0 and look how far they came and uh but you know we we can look back and laugh at how naive they seem to be uh back then but we needed to get through that to get to be where we are today and i think you know in the ml up space in the in the uh model up space and we're at 1.0 with a lot of these products and so to to understand what it is that we need to learn we also have to understand that many of the things that we're learning many of the concepts that we're learning now we may be throwing away in 18 months we may be throwing away in two years as we're rapidly iterating towards you know version 2.0 and 3.0 and 4.0 of ml up systems so i think you know that's to me that's why i'm particularly excited that you know deep learning ai google we're all working together to be able to create courses like this the chip is doing the stuff that she's doing at stanford because um we need to get a like a critical mass of people with these skills in place so that we can learn from our mistakes we can learn what's working we can learn what's not working we can see the opportunities out there that you know startups and other people in the industry are going to be able to take and run with to be able to build an improved system for everybody and uh so i think you know it's uh it's it's particularly exciting time to be working on this stuff it's like like i said if those of us are in the industry when java 1.0 came out or when windows 1.0 came out look how much we've grown and i would love like five years from now 10 years from now to be looking back and saying like with ml ups and model ops is like look how much we've grown yeah i think it's not like any of us yeah regretted right learning things on deep learning or tensorflow one when when it was mature i think people shouldn't win tfx uh and then they'll get better and it will all evolve with the tools yeah the more people using it the more requirements are going to be driven right and then the the better we'll be able to build whatever product not just tfx but anything yeah i i very much agree with that the one of the differences for tfx is that it's been used so heavily inside google for so long that it is a lot more bulletproof than and really well thought out that than uh that a lot of you know something coming from a you know a different company that is just brand new so there's so there's that going on but to get back to the original question what what should they be learning if they want to be in ml ops um it's it's a difficult skill set and it's a rare skill set you you you really need to understand both the ml side and the software engineering side to be really effective and those people are hard to find i can tell you i've tried to hire and usually you have one or the other you have people especially coming from like a mathematical or statistics background who have you know very strong theoretical understanding of ml and you know maybe you know something like r or python well enough to put models together but in a you know in a production setting it's very difficult for them to create production level code and systems so it's it's tough you you kind of need both sides of things and it's you know it's a lot and i feel like there are there are no jobs and there are very few jobs in ml ops right now you won't see that in a job description but don't let that fool you lots of companies are trying to hire people they know how to build and deploy machine learning systems so job interviews will also deploy the machine learning system that's basically an ml ops question even though the word ml ops doesn't appear in the job description so i think this is a important skill very useful skill for people to be learning today i know chip just open sourced her book on like ml interviews and all that kind of thing i'd love to know have you had any experience or have you seen anything while researching that book around the type of skills people are looking for in ml ops or model apps um so so i think it really depends on like different company size and also like where they are in their adoptions maturity um so so i think what i'm trying to get my student um to um to practice is just like like don't focus too much on techniques at all i don't like just go after like the latest passwords like try to like sort like try to be like problem focused so like what's the problem you're trying to solve and i just try to find it whatever you can to like solve that problem so so like um instead of looking at hey what's the latest um terms or tools or ml office like why don't i just try to do a project i try to deploy a mod a simple model on the phone and in the process of doing that i suddenly realized so many problems and also there's so many tones that's like um at least i like most a lot of my students can't appreciate until they have run into the problems that requires those tools so so yeah so i think um i would keep encouraging my students and so like people who read my my books in interviews it's just like let's try to like do a lot of uh projects i try to get involved i know that like there's a project you do um you know it's a personal project going to be very small and might not as a scale required for a lot of companies but that's where you have to start and then you try to get involved more um i get the internships with companies when we allow you to like try try it out at like at a different scale you know i i think what what what what should alluded to a couple times i think it's actually a very interesting point i find that mlr's practices are quite different based on the scale of the data set and maybe comfy size so i remember i once you know build a face detection and recognition system with over 300 million images that use a certain set of processes whereas at uh landing ai which is working on the mlof's platform for computer vision you know we often work with hundreds of images and so the tools and techniques are very different the other gulf is structured versus unstructured data with structural data you can get humans excuse unstructured data like images and audio okay humans examine it label it whereas for tablet data or structure data you know human judgment on it use hard to look at a list of transactions and figure out what the person was really intending when they did whatever so i find that that these gaps big big big differences in practices so yeah these days just because of my work of landing i spent a lot of time trying to think how to innovate and also computer vision and how to you know help all of us collectively build and deploy computer vision systems 10 times faster but i think there's actually a huge space and a very large family of techniques to be invented still all right thanks everyone uh i want to make sure that we spend a little time getting to questions that have come in from the audience and uh so what we'll do now is shift to answering individual questions and what i'll do is uh ask for one person to give a perspective and a brief answer and then we'll move on to the next one just so that we can try to get to some number of questions uh first off we want to start with a couple questions from uh learners who are in the current set of courses from deep learning.ai the machine learning engineering for production and envelope specialization and so the first question we have is from pablo drummond the question is about experiment tracking and he says i understood that the mlaps pipeline needs to be data centric not model centric but when it comes to the experiment tracking like models that might not have been deployed do we really need an application or framework don't you think that we need to address this requirement as an ml ops a part of the mlaps infrastructure does somebody want to take that one i i can try a little bit with a bit of experience that i've had with these things and primarily in my case it was with building models for use in mobile uh so to deploy them to an android or an ios device and um we have an infrastructure at google a product sorry i hope this doesn't sound like a product pitch called firebase and uh one of the really nice things that we were able to do with firebase was deploy multiple models based on different data sets so based on different neural architectures and then have like an a b testing infrastructure that would say hey this cohort gets this model this cohort gets this model this cohort gets this one and then tie that with analytics so that we could see the model performance you know based on you know different cohorts how how they did with the model did they get better inferences that they get faster inferences and those kind of things so my only experience in like in that space is really doing it like that and i found that it was really really powerful way of really getting feedback from multiple users so that we could fine-tune uh models so that we could fine-tune the like the neural architecture using the model the data sets that we use in the model and those kind of things so you know as well as data-centric ml ops we could also have model-centric ml apps like that with you know a smart way of deploying models to different users sorry andrew i think you were going to say something yeah i i think thanks so much and thank thank you pablo uh for the questions i think you know uh some some of us have this experience where we run the experiment finally does well on the test set and then we talk to you know our colleague that had worked on and said thanks for sending me that data set where i should get this data set from he goes oh yes it's on my laptop and then he says oh yes but i built that data set because my other friend had emailed the data set and is on their laptop and where's their laptop oh shoot my laptop got stolen or accidentally doing the file and see if this great result is completely not replicable you know ever again uh so i find it and i remember when i used to do experiment tracking you know i would use vim to edit the text file on my desktop right so that was local to me and then we upgraded to doing experiment tracking by filling in rows of a google spreadsheet which could at least be shared but then i'd forget some detail i forget to fill the learning rate in some column spreadsheet and became non-replicable again or we end up going back to redo the hyperparameters to figure out what exactly did we do to get this result so i feel like um in principle experiment tracking can be done in a in a manual way but um i think that i i do think uh that with tools for experiment tracking which should keep track of the model as was the data as was the data lineage and providence how on earth did all the data go through all this stuff to result in the model then that actually makes it more uh possible to figure out how we got the model and then also how to improve it thanks andrew and and uh we have we have another question from someone um this is fr

Original Description

Welcome to our event celebrating the launch of Machine Learning Engineering for Production (MLOps) Specialization featuring AI leaders in MLOps. Topics we plan to cover: -To what extent does the role of Data Scientist or MLE involve MLOps? -How is MLOps actually implemented in an industry setting? Is there some kind of a framework people use? -Is MLOps suitable for early-stage startups or only teams with enough resources as the big tech companies do? -The latest trends on MLOps and how will the future of it evolve. -What do you see as the biggest challenges for MLOps adoption? -Apart from taking courses, what are some of the other resources or activities might recommend to learners interested in gaining practical experience with MLOps? Speakers: -Andrew Ng, Founder, DeepLearning.AI -Robert Crowe, TensorFlow Developer Engineer, Google -Laurence Moroney, AI Advocate, Google -Chip Huyen, Adjunct Lecturer, Stanford University -Rajat Monga, co-founder, Stealth Startup; Former lead of TensorFlow, Google -Event moderator: Ryan Keenan, Director of Product, DeepLearning.AI Let us know what you think of the event by filling out a quick survey here: https://bit.ly/3janNgZ To learn more about DeepLearning.AI and sign up for future events: https://www.deeplearning.ai/events/ To sign up for Machine Learning Engineering for Production (MLOps), https://bit.ly/3j1DEhB

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 0 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches the basics of Machine Learning Engineering for Production (MLOps), including its role in industry settings, implementation, and required skills, with a focus on making machine learning systems production-ready and adapting models to changing data environments. The video covers various tools and techniques used in MLOps, such as TensorFlow, Kubernetes, and Firebase.

Key Takeaways

Collect new labeled data
Retrain models
Apply software engineering principles to MLOps
Update models regularly
Learn from DevOps principles
Build model
Run data processing on multiple machines
Deploy model using orchestration systems
Fine-tune models using feedback from multiple users

💡 MLOps is a distinct discipline that encompasses the entire life cycle of machine learning projects, requiring a balance between best practices and tool knowledge.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

The AI Problem That Was Never About AI

The AI problem is not about AI itself, but rather about understanding its limitations and applications

What If Your Surgical Stitches Could Tell You an Infection Is Coming?

Discover how AI-powered surgical stitches can detect infections early, revolutionizing patient care and outcomes

The AI RAM crisis: did legacy tech just give up its seat to China?

The AI RAM crisis may have led to legacy tech giving up its seat to China, impacting consumer-grade RAM

The Great AI Quiet Period: Why No Frontier Model Launched This Week (July 2026)

The AI world experienced a rare quiet period with no major frontier model releases, likely due to a recent executive order requiring labs to provide early access to the US government

Tackling Malaria in Africa with Technology at the Huawei ICT Competition