Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Key Takeaways
The video discusses operationalizing machine learning, covering patterns and pain points from MLOps practitioners, with a focus on the production ML lifecycle, key properties of ML workflow and infrastructure, and strategies for sustaining model performance. Tools and techniques mentioned include ML engineering, MLOps, and data-centric AI.
Full Transcript
foreign Anderson here from out of bounds incredibly excited uh to be here today for our six five side chat to chat with our Shreya Shanker from UC Berkeley um about operationalizing machine learning uh talking specifically around puns and pain points from mlops practitioners around um an incredibly exciting paper treyar and colleagues have published recently we're going to take a couple of minutes to uh let everyone in um so if you wouldn't mind while where other people are joining um I would love if you introduced yourself in the chat on YouTube if you let us know who you are uh where you work what you're interested in where you're calling in from why uh operationalizing or productionizing machine learning is important to you that would that would be super cool um and we can tailor some of the conversation to your specific uh uh roles and and verticals and all of these things as well um so we'll take maybe one or two minutes to let more people in and then we'll uh get started foreign Hugo ban Anderson here from out of bounds just welcoming you all to our October fireside chat with Australia Shanker from UC Berkeley talking around puns and pain points from LML Ops practitioners try to say that 10 times in a row really fast um I'm very excited to be here today to speak with with Trey particularly to talk around ideas in her uh and her colleagues recent paper um which is about qualitative interviews uh and a qualitative study of what actually happens on the ground in uh productionizing operationalizing machine learning we'll get started in a couple of minutes um but if you wouldn't mind introducing yourself in the chat uh that would be fantastic let us know what your interest in machine learning and ml Ops is uh where you work where you're calling in from I'm dialing in from Sydney Australia it's a beautiful uh early spring day here um and treya who's about to join us is in Berkeley um so I'm interested where you all are but we'll get started in in a minute or two um so if you've got to introduce yourself in the chat that would be great all right everybody we're going to get started uh very soon Hugo Brown Anderson here from out of bounds um today I'm here with Shanker to talk about pens and pain points from mlops practitioners uh before we get started I thought just to let you know um if you're interested you can sign up for our next fireside chat which is in a month um how to build an Enterprise ml platform from scratch uh bitly ml hyphen INF I'm really excited to actually spoke with Russell the other day Russell is at realtor.com he's worked as a data scientist uh software engineer uh ml engineer a platform engineer and a machine learning team lead um and also built at Op City and then at realtor.com uh Enterprise machine learning platforms from scratch so we'll be really getting into the nuts and bolts of what that looks like from the platform engineering side and then kind of an overview of how these teams work to get ever which I'm incredibly excited for so if that looks of interest it's definitely of interest to me clearly um it would be uh great if you could could sign up um and oh we've got Evan Aldridge here who just uh said hi in the chat hi Evan um Evan works for NVIDIA Building open source software for building deploying and maintaining recommenders very exciting to have you here Evan if um other people want to introduce yourselves in the chat that would be fantastic um but without further Ado um let's get started Shreya I think people have heard enough from me and people are here for for to hear all of your thoughts um so if you wouldn't mind sharing your video and I'll do the same hi Trail cool hey thanks for having me I'm excited to be here such a pleasure to to have you here um and I thought maybe I'd just give a bit more context I'm going to introduce you um and then we'll get started and feel free to correct anything I get incorrect um you're a computer scientist currently doing your PhD in databases at UC Berkeley um but before that you were I mean you you didn't go straight to grad school you're an industry as the first ml engineer at bioduct you've done research at Google brain software engineering and Facebook as well so you have a breadth of experience um I'm I suppose maybe to set the scene finding out a bit more about you um what why'd you go back to grad school after all of this industry experience um I I felt like I got into machine learning with the promise of you know training models and building production or serving predictions to end users who could benefit from it but what ended up happening in my job time and time again like as a machine learning engineer is I was building systems and I was doing engineering I was monitoring data quality I was reasoning about how we could keep models up to date how we can make sure we can react to bugs with little downtime even find them there's like a whole plethora right of problems in that space that weren't necessarily the problems that I wanted to do in the first place um so I decided to go back to grad school just to study them can we solve them I would love to be able to go and be an ml engineer and focus on the cool parts of the machine learning life cycle um and that's kind of what I'm studying and how I'm trying to get there fantastic um I think that provides nice context what we're here to talk about I I'd like to know more motivation for why you wrote this paper in particular but I'll set the scene by just mentioning a couple of sentences from your abstract um and apologies for reading your words to you but they're for everyone else as well um we being you can conducted semi-structured ethnographic interviews with 18 machine learning Engineers working across many applications uh including chatbots autonomous vehicles and and finance a lot of the project I I think involved um you know pattern recognition for what's actually happening on on the ground in in the space but thinking through the process of operationalizing ml um and what this actually involves um and recognizing patterns pain points um and the variables that govern success which we'll we'll get to um but why why this and why now great questions the the actual story of what happened was a professor asked us a professor from a different School asked us to write some sort of repository on awesome ml Ops resources and we were kind of like okay that's a little bit weird because we're in academics it doesn't make sense to write this GitHub repo but let's do it anyways um so we were writing our draft and we thought like you know it would be nice to have something to back up like why should we care about provenance why should we care about debugging speed these are sort of like gut feelings that we had but we wanted to back it up with actual research um so we decided to do an interview study and I I going into it I didn't really know how challenging it would be my advisors warned me hey like interview studies take a year plus and I'd come from industry aware I think when we do user studies we kind of just talk to a few people send out a Google form and come back get some results and month but the academic way of doing things is incredibly rigorous I did not know this you go through multiple rounds of like finessing the questions you go through and annotate all of your transcripts and find detail you make all these word clouds you connect all of your annotations together you have to write a paper that goes in for peer review and I think had I known all of this earlier on maybe I wouldn't have done it but I I really wanted to do it uh I really wanted to get some like statistics or some some anecdotal evidence to kind of back up what I was saying and along the way I was when I was doing this correctly um I I was super surprised to find out some of the pain points that I didn't know of some of the myths that I had going into the study and then some other things I learned from it um all sorts of things so all in all I'm super happy that I did it um would I do another one right now probably not but I think it's also separately super exciting uh it's a super exciting time to be doing such an interview study because I think a lot of people are trying to build ml Ops tools and are looking for some guidance on what to build how to build how to reach to how to reach to their customers um and they don't have the resources that we do at an academic institution to be able to rigorously and identify pain points across different company sizes and verticals so yeah that's amazing and that's a lot of the value I I get from I think there's I mean as working for a company and working on an open source project um which is which is a set of tools I find Value there I do think there's a huge amount of value for um practitioners as well to recognize what types of practices actually actually occur and speaking of the academic and you may recall and people who are viewing may know but I used to work in academic research in biophysics and cell biology and systems biology and so moving to industry was the opposite um and I actually remember the the one of my first um meetings for the first company I worked for somebody was like look at this exponential growth and I was like is that actually exponential could you use a log access and they're like what are you talking about man um so what I I'd love to something else we'll get to that I think is fascinating about about the paper is how we're in such early nascent stages and so much of what people do is currently under scientific for a discipline which is attempting to do science in Industry so many of the practices are nascent and evolving um so I think that'll be a really nice nice thing to to hit on um I've got someone asking is there a link to Australia's paper that's a great question well the answer is yes there is a link um and I'm actually just going to find it now in order to to paste in the chat um while I'm doing that trail I'd be interested to know maybe you could give us a rundown of the type of people you interviewed for this study yeah um we interviewed people who identified with the job title machine learning engineer and the requirement was that they had uh put a model in production and in production means somebody's pager goes off if there's a bug um so predictions are getting served to end users and these end users are using them it's presumably for something otherwise the alerts wouldn't go off um and the ml Engineers are sitting there in that Loop in that process of shipping models for that task so that that's the specification that we had and we Source people from different size companies small medium large and also different sectors autonomous vehicles Finance banking recommender systems at social media um honestly I don't remember but it's in one of the tables in the paper but we really wanted to cover a breath of applications as well as company sizes but the sampling bias here is that everybody identified with the engineer title so they had put something in production they had some engineering expertise so I think a lot of different results or some different results might come from interviewing data scientists data analysts people who might not be sitting closer to the engineering side of things yeah that's fantastic and I I I want to point out that this is and this isn't to be a contrarian or to play Devil's Advocate but I think because it's actually a very confusing term that this is one very useful definition of what it means to productionize machine learning models right um serving an endpoint um and and these these types of things whereas you could imagine there are broader definitions where it's like let's say you use an ml model to produce a deck that informs a pretty serious executive level multi-million dollar business decision that's operationalized in some sense but it isn't what we're talking about here exactly I think the key here is that these predictions are sustained they're not a one and done yeah fantastic um I'll also mention people are saying things in in YouTube chat which is great um if you have any questions that come up please do ask them and we'll try to get to them um we're also having an async AMA over the next week or so um and so I'm going to post a link to our community slack um and if you want to join that the channel AMA guests we'll be having um an async AMA there and any questions we don't get to today we'll try to get to there um but definitely ask questions in in in the chat so I'm interested in what you discovered in terms of I mean at a at a basic level of packing but pattern recognition just what what the main tasks that people do in the production machine learning life cycle are what did you what did you find out I think a good answer to this is what the textbook says and then what we Facebook will tell you for machine learning it's first you collect data step one step two is you train a model step three if you evaluate that model on a holdout data set make sure there's no overfitting and then step four is you deploy um and the what we found is that we can still categorize into four steps um and maybe the data collection part is similar except for it's more of a look like every I don't know a week or so we want to collect new data we want to make sure there's some QA on that data or something but the rest of the steps the last three steps are totally different um the Second Step what I said before model training is actually experimentation in general whether it be training new models whether it be trying to Source new data or adding new features there's a lot of ways you can think about improving a model um and a lot of the participants actually preferred to look into finding new data that gave new signal um or making future others more fresh instead of stale features that they had before so that's kind of step two stage three and the process was we call evaluation and deployment so evaluation is not a one and done thing what happens is evaluation is kind of done maybe on a holdout data set at first and then it's deployed to a small fraction of users and then when the bottle shows a little promise there increasingly it's deployed to more and more users as we learn more about what it can do what it can't do what failure modes exist how do we go and Patch problems until we've kind of gotten to the whole population um so key takeaway is evaluation is not a one-time thing it is a loop on evaluation and deployment a multi-stage deployment and then the latest the step four we found was this overall monitoring and response stage which was when you do have these models in production what is their what is their live performance um if you see the performance dropping what are the bugs where are the bugs how do we respond to them quickly whether that be actually trying to go do root cause analysis or simply retraining the model there is a stage around making sure that there's little down time for these services so we do we have those four stages shown in the first figure in the paper um and and it was interesting to several of us authors that they don't match the textbook I think that's like kind of a narrative that we want people to take away absolutely and let me ask are these um different steps I mean there's an iterative Loop happening there but are they kind of coupled in someone because you could imagine monitoring and validation yeah aren't always separable right totally um I'll say that monitoring is kind of done across the state or kind of in all of the stages um people monitoring their training jobs people monitoring data collection a lot of the times there's human in the loop processes to collect and label data to verify some quality whenever you see a failure um on the ground to go and collect examples that look like that failure so you can go back and augment your validation sets so in that sense there's monitoring all over the place but the one stage that we found was that was super iterative in itself was um evaluation and deployment um evaluations data sets never stay the same they're always changing they're always growing especially in tasks or domains where failures have such a high cost like autonomous vehicles are a great example of a failure is a really high cost when we observe one we need to make sure that we have no more failures like that so how do we go and invest efforts into making sure our foldout data sets whenever you evaluate your bottles new models in the future they're also robust to the problem yeah and I suppose drilling down even a bit more into data collection and I think this is one thing that that you're getting at in textbook data validation you're given a holdout set right as opposed to be actively getting that data and validating it and making sure it's the right data then using it as validation set there are feedback delays there all of these totally totally um I feel like I was never taught this in a machine learning class nobody taught me how to actually evaluate a model it's not like I need a big checklist um I just want to know if I were to be a machine learning engineer for a month not just one time how do I what do I do about my validation set do I keep it the same do I grow it when do I add to it what do I add to it I think this in itself super interesting problems for people to consider yeah very much so um so something I'm really excited about is you've identified three Key properties of ML workflows and infrastructure that dictate how successful deployments are yeah um what are they and and why are they so um we we have these three V's they're not even related or we didn't even know about the Big Data 3vs um Joe was telling Rolando and me hey there's Big Data 3vs are you sure you want to make three v's and then we're like oh shoot we shouldn't make three of these but then on second thought we were like I think it's cool if we'd say that we came up with these three V's independently because it kind of highlights the synergies between machine learning and traditional data um but these three V's are Velocity validating early and versioning um and why did we come up with these three V's we wanted some way to explain kind of best practices and pain points among that we were looking for patterns in what our interviewees said and since we asked such open-ended questions they're in the appendix but we asked open-ended questions like tell me about a bug you had last week something that caused you things like that are so open-ended they're so hard to extract patterns from um so it helped us to come up with these variables like velocity people kept mentioning that they needed to iterate quickly on experiments because they had a large Frontier of ideas to try and they wanted to see something that would give a production lift so we wrote down like velocity velocity velocity when we were looking through our codes at the end of it they were like oh when people are doing experimentation they care about velocity and it really resonated up with us when we started thinking about ml Ops tools what makes an MLF tool successful well experiment tracking is a nice space because it really 10xes your experimentation of velocity now I don't have to go copy paste into Google Sheets and back maybe that works if I'm the only person working on my model but at the moment that multiple people are working on an ml pipeline or model system then it's super nice to centralize all of the experimentation we do so we can share the knowledge that we've had so we had velocity for that for validating early um a lot of people complained about the fact that at their organization are there too many bad models made it to production or uh so that was like it was validating too late or um models were validated way too early and that they couldn't get anything to production so for one example in an autonomous vehicle company um the cost of deploying a bad model is so high so they Incorporated all these checks they made evaluation take much longer they decreased the velocity and Engineers were grumpy but at the end of the day I think there's a quote in the paper that says that you know we'd much rather gate the velocity if it means that we don't get failures on the road um so again different tasks they they have different priorities where and I think that's also why people keep talking about how like machine learning like you know it's not even generalizable it's so different for different tasks and when you think about it through the lens of these these it's not that if it makes total sense for different tasks they just have different priorities some people prioritize velocity over validation if it means if it's like a Rex's problem or something where the stakes aren't so bad if there's a failure um for sure so in that sense we we really liked um this kind of framework of evaluating tools evaluating what people cared about and as people who like to build tools ourselves um there's some tool ideas that I've had that now I can confidently say that oh this is really not a 10x Improvement in people's workflows it doesn't really help their velocity it doesn't validate better and it doesn't help people manage any more versions so why bother um and I really like that way of thinking about it fantastic I I want to get into I mean it's so great that you've identified these these three um key Concepts I want to get into the trade-offs tensions and synergies between them but before that I think versioning sorry velocity and validation we have a strong sense of what that means now um I think practitioners probably know enough about versioning but I think a lot of people who haven't delved deep into this not about versioning code but they're not sure about versioning models and versioning data and these types of things so what what are we actually talking about when we're talking about versioning so versioning is across the entire workflow and the way that I like to think about it is when you have higher velocity when you are iterating through different ideas quickly now you have different versions of things to manage to keep track of in your head to keep track of in the system um whenever you are deploying models um every time you retrain a model that's a new version every time you get a bug report and you decide to implement a new rule to prevent against future bugs now you are effectively making maintaining a separate version of that model a great example is people love to use off-the-shelf language models but when incorporating that into their products like a customer support chat bot for example sometimes these language models will hallucinate things that are obviously wrong like if there's this great anecdote where um some end user asks like what time is the store open and the language model will output some a little hallucinate sometime it will not be the right time but it will say some time so what you can simply do is put a rule that says uh if there's a time in the output don't return it or like go return the actual time or something like that um but so this is great in that it prevents against future bugs but what does it do it adds a new version of your model for a particular input um so maintaining this versioning bloat is hard to do especially as versions only grow over time right people are not removing from production nobody is saying let me delete uh this uh rule that I used to have before and nobody wants to touch anything that works so how do you maintain this over time fascinating um I think one other question is when we classically think about versioning code it's human readable right yeah whereas versioning models like what what are we even thinking about what are we interacting with like run IDs and task IDs and that type of stuff and maybe dashboards resulting from it yeah and I'll argue that run right run IDs are not the way to think about versioning cognitively like I'm not going to look at a list of 10 uuids and like know the difference between them I think I I at least talking to the interviewers right it makes sense to think about versions in terms of oh this model version kind of handles this Edge case this model version handles this case this is a simple model version for when our main production model uh fails and we need to roll back to something simple this was last week's model version um so so I mean again right you could have all the models in the world but if you don't have a way to really to look at it and understand exactly what it does in your brain then it's it's really hard um and I think this is a mistake a lot of model Registries try to make which is like show all the uuids and bold letters um I I don't know what to do with that information maybe somebody else knows but um that's exactly what you said right code when we see code versions We can look at diffs we know what's the difference what does that diff mean to us in the data and the ml space yeah and I love that you mentioned rolling back and I do want to get back to a broader vision of the three V's but we may as well just just drill down a bit more now I am I'm interested in the different types of mod if something goes wrong do people have something which they can always roll back to you mentioned this idea of Shadow models there are also Challenger models what are all these different things that people are working with when they're doing all of these this experimentation and then uh deployment yeah depending on the I mean size of the company or kind of how critical the ml pipeline is to the business there can be so many there can be thousands of versions of the same model um so to give you an example of like a really really large company uh every data scientist is kind of like running a live experiment or maybe a B test uh one or two um at least and so there's that many versions of the model going on um at a lot of these companies also and not just a big company is there's like a backup model um that people have just because I think there's one quote in the paper that's like we will switch to an economically viable less economically viable model if it means that there will be no downtime right like you just cannot afford to return nothing you can't afford to return absolute garbage like a lot of times what happens with the performance is like suddenly it might tank um and then it might tank to something that's worse than the Baseline simple rules but at that point right you want to have the Baseline rules there so you have something uh not nothing um and I think like the more mature company gets right and like the more customer more Android users they have like Facebook cannot afford to go down they'll have something there they have many many models to make sure that like there's always something to roll back to for almost all of those prediction systems yeah but again that that blow that versioning blow is like unbelievably hard to manage and what's a shadow model um so some applications can do this in some command but the idea is to have a model making predictions on live data as the live data is coming in but you don't surface those predictions to users this doesn't work very well when you require user feedback like in some recommender systems yeah um where there's like a loop where the human kind of influences the algorithm vice versa in such a short period of time um but I've worked on cases like in predictive maintenance is a good example of like I'm predicting when parts of vehicles or parts of the computer or any equipment might break um I could kind of make predictions in the background and then um a few weeks later check how did those predictions do um and a nice aspect of Shadow mode is that it um it removes or it you can't have like any assumptions or maybe I'll phrase it this way a lot of times when you develop models you might have some assumptions about the data that don't quite exist live um and feedback delays is one of them um so when you are deploying models in this Shadow mode you minimize that assumption Gap like you're in the prod environment you're there you have access to what the prod model will have um so that's good but it only is alphabetical applicable in certain tasks yeah great um you mentioned in the paper and this is something we hear all the time that 90 of models don't make it to production or something like that that's one of the most cited whether we know it it's like you know 80 of data science is preparing data so it's one of those those things right but it's it's believable but you you make a point I don't really make this no you make it explicitly but this is something that's often stated as a negative but let me ask you this is not if ninety percent of models don't make it production is that a good thing or a bad thing I don't think it matters like what matters what the large what more mature ml companies and teams will say is as long as the experimenter can go from idea to results as quickly as possible yep it doesn't matter if there are thousands of failed artifacts in the process um like if I'm running 10 000 experiments in parallel which I would never do because I cannot keep track of that somebody probably would if I'm running 10 000 different tuning jobs in parallel and uh 9999 of them fail but one of them succeeded that's gone like that's great um absolutely yeah and I I think I mean we forget that science is about yeah yeah and and I mean Max planco whoever said science progresses one funeral at a time something like that and I mean that that's so that's so true I mean science Works through invalidating hypotheses and then promoting the ones that we're most certain of quote unquote right whatever the hell you want to quantify that if you even do a Quantified which a lot of the time we don't totally totally I think that's also to the point that like velocity is super important um and then how do you validate early so you don't waste time on projects that will fail right I think that's the thing to care about if there's something that is destined to fail and we know that or we have a high how can we estimate the likelihood that something will stick feed or fail pretty early on and then ditch the project if it's going to fail totally totally so this is this is great now we're getting back to the three most important aspects um that determine the success of operationalizing and deploying ml velocity validation and versioning you make it very clear that I'm going to actually quote you um High Velocity um means creating many versions in other words having high velocity means drowning in a sea of versions of experiments right so I'm envisaging there some Pareto front where there's attention and trade-off between velocity and uh versioning but then you also mentioned there are synergies between velocity and validating early so if ideas can be invalidated in earlier stages of deployment and overall velocity um is increased one more thing you mentioned is creating similar development and production environments exposes attention between velocity and validating so the development Cycles are more experimental move faster than production Cycles however if the development environment is significantly different from prod it's hard to validate idea years early so we have this oh there's some sort of Pareto triangle with synergies as well so I've mentioned a few but can can you just speak a bit more to the relationships and correlations and causations between these three incredibly important things yeah uh uh um I think the the knife anecdote there is on Jupiter notebooks um for a long time we had all been seeing each other some people so excited about Jupiter notebooks some people absolutely hate to Bruno books everybody has strong opinions everybody wants to give a monologue on their opinions on Jupiter notebooks and I think for me it's been like so many years of hearing this kind of over and over again why people love it or hate it and I wanted to know why it was so polarizing um and it was very satisfying to me to hear this uh or to kind of frame it as this um kind of where do people lie on the velocity and validating Spectrum some people want to move fast and break things in the Facebook speak and they're okay with that if they can fix it some people do not want to move fast they want to make sure that there's no buggy models they want to make sure that everybody can review each other's work yes that hinders velocity um and kind of like it's really hard right some some people want Jupiter notebooks because they can go fast some people don't want data scientists to have they don't want data scientists to go too fast because then maybe certain scientific principles are disregarded maybe things are irreproducible I don't really know but it was nice to frame it this way because there's no right answer right it's like where do you personally lie on the Spectrum and when you run a company or you run your team like what is the ethos that you want to create um around where do you guys want to lie um amazing that was interesting to me this is incredibly useful I I mean this framework so what I'm hearing is in this framework of velocity validation um and and versioning we can look at people who uh prefer Jupiter notebooks and in this framework they're essentially prioritizing velocity whereas people who are strongly opinionated against Jupiter notebooks uh prioritizing validation and versioning or mostly validation or how do you how do you think mostly validation um I think of it as validation more because it's like how do you uh what like how do you make sure the development and production environments are as similar as possible um whenever they're there there's a discrepancy there is chance for books um so you can remove the need to validate a lot when promoting from Dev to prod if there is no real like environment change from one great example is like sometimes people will iterate locally and then deploy to the prod service in Cloud that it's a huge environment mismatch so you need to do some sort of big validation I don't even know I don't think people have solved this problem of like making sure there aren't bugs in this mismatch of environments um but again right like why why is it so separate like a completely different Hardware not even in the same like Cloud um crazy I I think this also exposed to me that like I am now feeling somewhat opinionated it's not the corrective there's no correct opinion right it's like what is the opinion you want to hold and you want to prescribe for the team that you are running yeah absolutely and a question around notebooks for example is if they don't have affordances which we'd like them to have maybe as tool Builders we build them into them as well right it isn't yes how do we want to build the future as well right yeah yeah yeah as you point out we have a really interesting comment from Evan Aldridge who um is the Evans are in video building almost software for building deploying and maintaining recommenders Evan wrote um with respect to uh models failing or experiments failing it's key um whether it's failing because you're invalidating a hypothesis or whether there's an engineering failure that prevents the model from ever making it to prod um and that just reminded me something that I found fascinating in your paper you know in machine learning we think about hyper parameter tuning all the time and that that type of stuff but you made it clear that a lot of the time you can get bigger wins from other parts of the pipeline as opposed to you know just spending gpus on hyper parameter tuning on some massive grid right totally totally um I guess to Evan's point about the engineering so many of the participants and even papers like it is not new news that engineering bugs are largely the cause for kind of production model failures um like a data pipeline didn't run or now like a lot a lot of columns or no or you ingested a data set that you usually ingest but somebody swapped two columns happens all the time right like silent failures make their way to garbage predictions um and like there's that also gets to the data validation pain point that I talked about um the Goldilocks thing in the paper of you know like data doesn't need to be a hundred percent pristine and perfect going into machine learning models but it shouldn't be zero it shouldn't be total garbage so what is The Sweet Spot how do you identify this this is of course different for different tasks and then in one I'm writing a paper on automatic data valuation right now um it's even different at different time periods like the company that was also writing with people that I'm Consulting for that were writing the paper together they're saying like oh around like Thanksgiving time and the holiday time we need to make sure that like we don't have failures like we don't care if there are more false positive alerts there just can't be failures and now I'm just like oh my God like how are you supposed to build a system to satisfied absolutely an automatic data validation system for all sorts of tasks all sorts of data sets um for that's like a tunable depending on who is on call literally and what time they're on call yeah fascinating so we have another question from srinand which I'll get to because it's actually around tooling and evaluating tooling and I want to spend a bit of time at the end talking about tools so just just for shadow that you're done we'll get to your question it's a fascinating question you mentioned um kind of the role of data in in these models and you hit on this in in your paper um but I'm interested if you could speak to what you've seen with respect to the differences between I suppose the data Centric machine learning Paradigm and the model Centric and this comes back to your point about what's taught in textbooks and what's taught on I suppose you know online on like all the competitions we do and and that type of stuff also um yeah so I mean I think the textbooks this course is now I don't know about textbooks are talking about data Centric AI uh but the premise is flawed it's the same premise that people talk about model-centric AI in that by hook or by crook we will edit model hyper parameters until we get something that works on a small validation set people are doing the same ethos when it comes to data Centric AI by hooker by crook we will add three or four examples or remove six of these labels or clean 12 of these labels and we will get one percent five percent better performance on a validation set this is the same ethos like maybe this is easier to do if in the data Centric sets but uh it then it's in the mono Centric sense but we found in the interview study that this is not at all the way to get lab statements right you want to get a win that lasts beyond the initial validation and we talked about this thing um in the in the section around experimentation where you want to find ideas that lead to huge gains in the first offline validation because there's diminishing returns in successive stages of deployments like in the offline validation stage if you get a 15 booster I don't that's kind of high but if you get like five percent um in the third stage of deployment later on it's only going to be like half a percent um so account for these diminishing returns as you go down all of a sudden like now that changes the the way you think about your experiments like what can I do to bring long-term gains right it's not about editing the view of data that I'm training my model on I want to add new signal I want to go find a new data set that will add new signal to the model I want to fix engineering problems I want to add data validation so I don't train retrain on corrupted data like these are the big wins that give you the long term boost right over the I I hesitate to like preach about data Centric Ai and the way that it's taught um yeah I don't I don't love it but curious what other people think yeah great um so I now want to jump into you talk about a lot of the pain pain points and I think um sharing stories of suffering is incredibly important for all of us civilizationally um but particularly in such a nascent field right um what what are the biggest pain points you've you've noticed and your colleagues faced by people deploying machine learning models to production yeah um the ones we talk about in the paper um kind of echoed by all the interviewees mismatched between Devin prod environments um a lot of bugs can be traced back to this like the label delays data leakage all sorts of things um another one being kind of data validation there's way too many alerts that engineers get when they're on call if they're monitoring models like if we put some Cloud watch alerts on every single feature for a model the probability that at least one of them goes off is insanely high and as a result people will say things in the interview study like we ignore 90 of alerts um so that's another pain Point um so these are false positives right yeah false positive alerts where it's like I get an alert that a feature has changed when the result taking model performance is like still kind of the same this is incredibly important because this isn't I mean we talk a lot about the technical stuff we don't always talk enough about the human so I think you yeah maybe even mentioned the word trauma associated with high high false positive but no it's really important to identify trauma in a in a discipline right if people can't do their jobs and are getting spooked constantly because of this it's a totally spooked as a good word people were saying how like they dread going on call some teams required two like larger models will require two people on call at the same time like a primary backup because it's just too much it's too much responsibility it's too much onslaught of information um recently I was talking to someone about someone was saying like oh also in the software in the SRE world we also get way too many alerts um we have so many unwatch alerts and I was like yes I know I've gotten both of the alerts the difference between the SRE alerts and the ml alerts is that you can over time develop a sense of which SRE alerts to ignore but you cannot do this in the ml setting the important features are always changing if I move to a new ml pipeline everything is different I have to relearn like what's the data look like where are we getting the most signal what are the most important features like what if it's not a tabular mod what if it's an image thing like all of my old knowledge it kind of goes out the window you can't like really anticipate what what like alerts you can ignore what you have to listen to and then there's no fix right it's like sometimes you know like some pods are down and they won't come back up and like I know how to fix that but sometimes if customers are complaining about low quality ml predictions I I don't I don't even know where to go right like I could be a veteran and an L engineer I wouldn't know where to go um and that's also another pain point was this like taming taming the long tail of ml bugs I think is what we said in the paper which is um people will spend so long slicing and dicing and trying to figure out where a bug lies only for it to never happen again some other thing of similar magnitude same debugging pattern same time it takes to debug completely different bug um and it doesn't get better right so how do we how do we think about making this experience better for people I think at least my personal interests are around like there are always good to be humans building and maintaining these ml models how do we make their experience better um that's how I approach the like date Point thing um what's the role of subject meta experts and domain expertise in in building an operationalizing models from yours like when I ask like oh like what do you look for when you want to try a new experiment how do you get an experiment idea of a lot of them paying subject matter experts on slack or they talk to people they any time this goes back to validating early get as much signal as early on in the EXP in the workflow as possible that this experiment will be a good experiment um like there's there's no wins around I came up with this idea of myself right like a lot of times companies will have like Lex researchers data scientists that are very familiar with the data or the customers or what people want um and it's really helpful to be on the same page I talked to someone and there was an interviewee at a mid-sized company who said uh that their personal productivity went up so much when they got a sense for what the product metrics were that were being evaluated around like click-through rate um and sometimes it wasn't just click through it it was like how long somebody spent browsing the recommendations all of the recommendations and and these are two completely different product metrics but we think about the same ml metrics for recommender systems right but if you have a good sense of the product metric that the team is being evaluated on right it's easier to come up with experiments that could really win in the end um so I think yeah the role of subject matter experts is very important but it's not just smes right it's it's broader than that it's people who are familiar with the users of the system people who are familiar with the data um as animal Engineers we cannot do it all right we cannot know everything we have to leverage information we can't even figure out dependencies so you know how can we be expected to you know I mean of course that's tongue-in-cheek but you're absolutely right and I I'm so glad and I wasn't sure we'd get to this but I'm so glad you mentioned the emphasis on making sure that we're evaluating our models with respect to product metrics which tie it into creating value for whatever organization we're at as well and make sure that all of these things are as matched as much as possible and that we continually evaluate them is incredibly important something we've been dancing around um is software engineering classical traditional software engineering and machine learning engineering same or different oh it's um if I say something I'll offend somebody no matter we've already talked about notebooks so we've clearly offended both but we have to offend everyone if we're going to offend someone so that's the rule um okay so I guess this is a nice preview for my Norm can't talk is going to be on all of my machine learning problems or data management problems um so I think so maybe that's how I feel about kind of engineering and I don't think it's like software engineering per se that is really like the skill set that an ml engineer needs to have if they want to be a 10x to use a stupid term 10x but I think it is an understanding what is what is a table what is a relation okay and what is a view this this is a nice one a lot of people don't even know about views it's fine a view is kind of I run some query on a data set I just store that as a view that will either be materialized before by query the view or it'll be materialized as I query The View okay so there's like a question of like when do I materialize the view um this is the same problem in machine learning if you think of a machine learning model as the view over the underlining underlying trading data okay when I trained the model that's when the view is materialized so all of the problems around few stalemists are the same thing there's a model stylus okay we don't want to train we don't want to compute The View on wrong data we don't want to train the model on incorrect data these are all the problems that we had talking about time and time and databases that are showing up in the ml World um so in that sense like I think ml engineering um really really is just like recasted data problems yeah fantastic um and anybody who I mean Norm Norm is a Free Conference that's happening online I think if you look up Norm comp you'll you'll find it um and I think correct me if I'm wrong but the premise there is that a lot of people talk about things that aren't particularly relevant like a lot of it's a lot of the conversation is live action role-playing so what actually happens on on the ground essentially I don't know if yeah I thought it was laughing but it's it's that's that's my take on it I I feel lucky that Vicky is letting me give this talk because I was supposed to give a practical thought of talk but my talk is very like um a lot of people think don't even know how to reason about machine learning but my argument is that you definitely know how to reason about machine learning if you know how to reason about data yeah that's a beautiful I'm definitely going to quote you quote you on that um so I want to get into into tooling I think it'll be a nice a nice way to of course we talk far too much about Tooling in the space um and you know there's the old joke we have far too many tools and not enough at the same time um but I'm interested in what you discovered around tools and maybe we could frame it around srinan's question um which I I find fascinating I think about far too much I mean this is I lose sleep over this um there are a lot of tools present for these problems being mentioned version experiment tracking Etc the ml practitioners truly try to evaluate learn and adopt these tools great question if an ml engineer is truly trying to the best of their ability to evaluate learn and adopt the tools that means they're not doing their job they're responsible right for pipelines and production they are responsible for responding to bugs they are responsible for making sure they're fighting fires 24x7 and this tooling industry is not something that if they don't know 100 that this tool will help them fight fires it is an absolute waste of time to try this one especially because there's a lot of tools out there and I think that a lot of people think like oh ml Engineers are lazy they don't even want to try my tool it's like no there's I I'm not confident that your tool will 10x my velocity validation processes or ability to manage versions uh 10x like not just slightly because it takes time to integrate things right everything is glued to each other um that at least that's how I reason about tools as a tool Builder and ml engineer I've been on both sides um so yeah as an engineer could I definitely spend more time using tools yes will I not get promoted for sure so like no I don't know what else there is to say about that and just to we've mentioned this several times when we're talking about 10x Are We Now talking in terms of velocity of actually deploying models time taken to deploy and then iterate essentially um yeah that's one of them right like if the tool really increases my velocity experimentation velocity awesome like a lot of people that's that's a long that's a Time scene if the tool makes sure that I never train a model on data that's corrupt because of engineering bugs that's amazing like so much of my time is wasted as a practitioner like figuring out what engineering bug caused my mL bottle of aggression um I I almost rarely feel like the tools are really designed to alleviate my pain points in that way I feel like a lot of modern tools are like look here's something that you can glue to your existing jumbo ball of tools that may or may not provide you with anything like I don't know so what then I think this is a nice nice issue note to wrap up on what then the biggest opportunities for future ml Ops for lack of a better term what mlops tools tool Builders and and research yeah read the paper for all of the typical ways like I bought take away and then I write take away uh the ones that I am particularly excited about research-wise because I need to get a PhD um I'm interested in an interface for a hybrid rule based plus AI system a lot of times right we deploy models and we tack on rules to it and these rules are often glued I would like a framework as a first class citizen I can Define these rules in a way that I Define models and compose them super easily um I think that's one very interesting way of thinking about like these evolved ml systems um but that's maybe more researchy more practically I think that there's a lot of low-hanging fruit in the data validation space I know there's a ton of ml monitoring companies out there but they don't seem to be solving this false problem alert fatigue like problem like think about it that way how do I make sure that there aren't that many false positives I can recall 90 of failures while maximizing Precision um how how do I do this in a model agnostic way I'm writing a paper on this so like that essentially will be solved um I'm trying to think about other I don't know like I think there's so many things that you can think about like around validation around helping helping people construct Dynamic validation sets right a lot of companies will have these like folders of validation data that they just add to because analysts are adding to it or labelers are adding to it um but how do we make it easier to manage these diversion these uh as these data sets grow I think there's so many things right we can we can do just a few absolutely please and data has a half-life right there's some Decay which it's it's relevance decreases right and that's one of the many reasons yeah um all of the problems are data management problems absolutely so I've we've got Evan from Nvidia has another great question any advice for Tool Builders to design tools that provide that 10x velocity um and he makes clearer we try to build those tools those principles into Merlin for example by thinking about not just the speed ups but the workflows of the engineers uh this is a great question um and some of my collaborations with non-technical folks um in the lab who were kind of building ml tools we're seeing that there's some slowdown when people are moving between um different tools like for example people will write code and then they'll write in their Google doc or they're right in their spreadsheet and it takes time to context switch between these things all in the name of experimentation but how do you remove that code like context switching um One-Stop shop for when you have the idea to like when you get the result I think that's very interesting yeah absolutely so it's time to wrap up which is which is I mean I'm enjoying this so much but we'll have opportunities to to chat more in in future I think it'd be nice to wrap up with not necessarily about this this work although if it's relevant be nice to know about that what um my question is what what are you really excited about thinking about next what are you working on currently what's what's most you know what what gets you jumping out of bed in the morning excited for the day in the ml space um I I think like as I'm wrapping up this automatic data validation I mentioned I'm excited about like interfaces for hybrid um non-ml and ml systems I think every system is going to be eventually like a combination of the two um and not just interfa like interfaces are an interesting question like a lot of people talk about how prompting is supposedly a query language it's absolutely not a query language there's no semantic script on it um but people why is it that these like language models don't have any practical use they're not being used in businesses very much with the exception of a few um we need some way to provide like some semi-formal guarantees on semantics um so so I'm super excited to think about that and then once that's kind of semi-formalized in a way what does it mean to run queries to optimize queries I love query optimization I'm a database person um so I think there's all sorts of interesting technical problems there that I'm very excited to think about fantastic um we've just got a comment Evan says thanks so much always great to hear from Shreya which is nice um and for those interested we're going to wrap up now but we'll have an async AMA on our on our slack and AMA guess so if you want to join um Shreya and I will answer questions I think um over the next next week or so um Shreya thank you so much not only for your wonderful paper and work but for joining me today um it was such a fun conversation and I really appreciate it thanks so much here you go I enjoyed being here awesome
Original Description
Shreya Shankar is a computer scientist doing her PhD in databases at UC Berkeley. She was the first ML engineer at Viaduct, did research at Google Brain, and did software engineering at Facebook. In this fireside chat, Shreya joins Hugo Bowne-Anderson, Outerbounds’ Head of Developer Relations, to discuss her team’s recent paper Operationalizing Machine Learning: An Interview Study, and what they discovered about the common practices & challenges across organizations & applications in ML engineering.
After attending, you’ll know about
- The main tasks that people do in the Production ML lifecycle;
- Key properties of ML workflow and infrastructure that dictate how successful deployments will be;
- The biggest pain points faced by people deploying ML models to production;
- What strategies ML engineers employ to sustain model performance once deployed;
- What the biggest opportunities for future MLOps tools and research are;
And much more! The fireside chat will be followed by an AMA with Shreya and Hugo at slack.outerbounds.co.
00:00 Prelude
04:42 The fireside chat begins!
07:20 Why it's important to talk now about patterns and pain points from MLOps practitioners
14:10 The main tasks in the production ML lifecycle
19:29 The 3 factors that determine the success of ML projects
27:30 Models to roll back to when things go wrong, shadow models, and challenger models
31:11 "90% of models don't make it to prod" can be a good thing!
33:16 Trade-offs and synergies between Velocity, Validation, and Versioning in machine learning
36:20 What using notebooks actually prioritizes
41:45 Is the premise of data-centric AI flawed?
47:57 The role of subject matter experts and domain expertise in ML
50:33 Software engineering versus machine learning engineering
55:40 What even is a 10x ML engineer?
56:52 The biggest opportunities for MLOps tool builders and research
Find out more about how we think about MLOps, OSS, and human-centric data science tools here: https://outerbounds.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 16 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
▶
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Metaflow GUI for monitoring machine learning workflows
Outerbounds
Metaflow Cards [no sound]
Outerbounds
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
Metaflow Tags: Programmatic Tagging
Outerbounds
Metaflow Tags: Basic Tagging
Outerbounds
Metaflow Tags: Tags in CI/CD
Outerbounds
Metaflow Tags: Tags and Namespaces
Outerbounds
Metaflow Tags: Tags and Continuous Training
Outerbounds
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
Metaflow on Azure
Outerbounds
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
What even is a 10x ML engineer?
Outerbounds
The 4 main tasks in the production ML lifecycle
Outerbounds
Is the premise of data-centric AI flawed?
Outerbounds
The 3 factors that Determine the success of ML projects
Outerbounds
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
Metaflow on GCP
Outerbounds
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
How to Build a Full-Stack Recommender System
Outerbounds
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
How to build end-to-end recommender systems at reasonable scale
Outerbounds
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
Natural Language Processing meets MLOps
Outerbounds
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
What even are Large Language Models?
Outerbounds
How to get started with LLMs today
Outerbounds
LLMs in production
Outerbounds
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
The Past, Present, and Future of Generative AI
Outerbounds
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (14)
Prelude
4:42
The fireside chat begins!
7:20
Why it's important to talk now about patterns and pain points from MLOps practit
14:10
The main tasks in the production ML lifecycle
19:29
The 3 factors that determine the success of ML projects
27:30
Models to roll back to when things go wrong, shadow models, and challenger model
31:11
"90% of models don't make it to prod" can be a good thing!
33:16
Trade-offs and synergies between Velocity, Validation, and Versioning in machine
36:20
What using notebooks actually prioritizes
41:45
Is the premise of data-centric AI flawed?
47:57
The role of subject matter experts and domain expertise in ML
50:33
Software engineering versus machine learning engineering
55:40
What even is a 10x ML engineer?
56:52
The biggest opportunities for MLOps tool builders and research
🎓
Tutor Explanation
DeepCamp AI