Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1
Key Takeaways
The video discusses MLOps, its importance in machine learning, and how it can help data scientists work remotely, with a focus on the intersection of software engineering, DevOps, and machine learning, and featuring tools like Git, GitHub, and Terraform.
Full Transcript
looks like we've got sort of critical mass of people showing up for this first inaugural ml ops community online meetup so thank you all so much for coming and really appreciate it my name is Luke I'm the founder and CEO of science and we're kicking off this ml ops community in collaboration with our friends at Bristol which as many of you may know is a meetup group that's that's normally based in South West of England around Bristol but I think we probably have guests from all over the world today because that's the wonder of an online event I just wanted to to quickly introduce my colleagues who use no space as you can see in front of you as well so we have Chris Demetrios and Mark here so these guys are here too to help get things started and they're available with if you have any questions on the chat please please just shout apparently you people can only see my face but okay that's fine and there are three other people on the call whose faces I can see but yeah great so I wanted to start this event really by I just sort of acknowledging that this is a strange time with global pandemic happening and that we probably most people are watching this from home so I just wanted to acknowledge that it's that it's a weird time and and and that hopefully we can sort of come together as a community and the way that we wanted to to contribute was was with this this Emma Watts community that we're kicking off today the idea is to share ideas and best practices around DevOps for machine learning it's intended to be an open and and friendly space so everyone is welcome and really everyone here you are the initial community so this is really your thing as much or more than it is that al thing so please do get involved I think after 40 minutes when we finish the talk we're going to open it up so that everyone can can speak and that we can have an open Q&A and discussion I hope everyone's okay with that bear in mind that this session is also being recorded and we'll put it on YouTube later so if you don't want to be recorded after the 40 minute mark then just keep your your camera off and and and your microphone muted and feel free to use the chat but yeah so this is this is really our channel as a community if you have ideas for for talks or ways to use the the weekly timeslot that we have here then please speak up either in the at the end of this session after 40 minutes or or in the slack or on the forum on em a lot stock community so yeah please don't be a stranger and I also just wanted to acknowledge bridge tech I already mentioned bridge tech but a huge thanks to to Nick and Thomas at bridge tech who helped us get the word out about this Meetup and they are sort of founding partners for this community effort so I'm going to jump straight in and I'm going to share my screen and go through the first talk of the evening or morning depending on where in the world you are and so hopefully everyone should be able to see my screen and I'm just going to assume that it's working because zoom just normally work and I'll also pull up the chat so that I can see people people's questions or comments and as we get through so the talk I wanted to present today is really an introduction to what is inner lives obviously if you came to this meetup you probably have some idea and you probably have an interest in it but I wanted to sort of start start this start these sessions with with an introduction and I also wanted to put a little spin on it which is how will ml ops properly done actually helped me work from home because it's it's now becoming the new normal that especially for well it's becoming the new normal that everybody in the world is working from home which is kind of crazy kind of interesting and in particular if you're a data scientist or someone managing data scientists then been working from home and working remotely and doing asynchronous collaboration can be particularly challenging because of some of the complexities so I'll also talk about that and then I'm going to do a live demo so please could everyone pray to the demo gods I'm going to attempt a live demo of some of our collaboration tooling and then hopefully we we have a good chunk of time afterwards for Q&A so to jump straight in then what is ml ops ml ops is the intersection of these three disciplines DevOps software engineering and machine learning and so I'm going to talk for a short time about each of those disciplines in turn and and then with I'm going to talk about how the intersection of them is kind of interesting and complicated so if we look at how software engineering has been done I'm not going to spend too long on this but the basic point here is that software engineering has changed a lot in in the time that it's been even a discipline we used to all log into big mainframes we used to email patch files around to each other and the way that that collaboration happened for software engineers used to very much be around sort of manually integrating changes to source code this was prior to version control prior to things like subversion and for the longest time the Linux kernel for example was developed by people just emailing patch files around and maintaining their own local copy of the source tree and that was painful for obvious reasons and it took a long time to integrate things and then subversion came along and we we got to the point where we had version control and you could check in and check out things and that was a great improvement and then the additional sort of improvement beyond that is this notion of distributed version control where you can sort of do social coding on github and and that's kind of in my mind from a collaboration perspective where where we've got to from a software engineering perspective so the next piece is DevOps and so if software engineering is how we create software than DevOps is how we deploy it and operate it and DevOps has also undergone some significant changes in in the last few decades and years I am personally responsible for being one of the people who used to edit code live on the server I don't do that anymore but but that used to be the way that a lot of things were done people also used to build binaries and email them around and so people used to for example send wire files email of our file to assist admin and get them to deploy the latest version of the software on the server and and obviously that had problems and then this thing happened that was called DevOps and DevOps is about breaking down silos between dev teams and operators and and and so it was kind of an a change in the way that teams are organized and it's also a change in the way that the software is deployed and managed so now the people who write the software are responsible for shipping the software and operating it and in in most well form DevOps teams and continuous integration is a thing so software it gets continuously - Matic be tested whenever changes are pushed to version control there's a little stir being a big move away from running your own hardware towards public cloud and more recently there's been a big move towards using immutable infrastructure like docker and communities and even this notion of 'get ops which is that you should also store the configuration for how your software is operated in version control as well as the source code for it and and then source control becomes your source of truth for the software that or the infrastructure that you're operating and that has some some obvious benefits and then finally the third discipline that's intersecting here is machine learning and machine learning is basically how we use data and maths train models so you make predictions about the world and I won't go through the whole history of machine learning here but just to say that sort of there was this thing that's called the AI winter which is where [Music] there was a lot of hype around AI kind of the first time and I think it was in the 70s people in the audience may well correct me on that and there was a lot of hype around AI and it turned out that the both the algorithms but also the hardware wasn't it wasn't good enough it's a little bit like the comm bubble in a sense that the internet speeds back in 2000 weren't good enough and the hardware wasn't good enough to the connectivity wasn't good enough to support all the hype that was going on about the about the web and this is AI will to do this is a huge peak in high penis around machine learning and AI that that then kind of collapsed when people realized that it wasn't actually going to be able to deliver on all the hype and we've kind of gone around that cycle once already and now we're in another hype cycle around AI and machine learning and now this time around it does seem like it is it is set to seriously change the world really and to change the way that and it's it's starting to get adopted very seriously inside lots of large enterprises which is which is interesting and I also just wanted to have a nod here to the picture in this slide it comes from a local Bristol company called Groff core and they make some very beautiful pictures that are representations of sort of visual representations of how neural networks actually work which which are pretty cool and so basically deep learning is now becoming and machine learning in machine learning and deep learning is becoming computationally feasible and there's this huge rush that all the enterprises around the world are in they're kind of locked in an arms race to to get value out of AI and so we end up in this place where machine learning is effectively emerging from research and it's being production eyes door it's like companies are attempting to production eyes it and this means that ml is emerging from research and that these disciplines of software development DevOps and machine learning needs to converge so that people can build models and they can deploy them and they can have the same level of rigor and reliability that software has in DevOps and and so this convergence of these things will be called and is already called ml ops and that's why we've named our community after this new emergent tone now so I personally believe that AI and machine learning has the potential to reinvent the global economy and hopefully in a good way because I think it has the has the potential to eliminate a great deal of boredom that is currently suffered by humans in in doing very tedious jobs it's kind of outside of the scope of this talk to talk about the political changes that are going to be necessary to to make that non-disruptive but anyway um so I think we can all agree that AI has the potential for for good and as a discipline what we found is that it's extremely immature so it's kind of like the Wild West out there and we've seen damaging levels of chaos and pain as companies go from what we call playing with AI to actually trying to use AI to deliver business value which is often referred to as operationalizing ai and so some of the things that we've seen that cause these pain points are things like models being blocked before they can even get to deployment so so we spoken with with many companies that are trying to do AI in production and one of the things that is a very common theme is that we just can't we can develop the models on our laptops but we but we just can't get them deployed into production and of course that sucks because that then the time to value is very very high if it takes a long time or if it's not even possible at all to deploy models even if it takes three or six months to deploy a model then then by the time you've got them on and deployed the data was already stale and and often AI projects in in companies get branded as a failure just because they can't around the door another problem that we've seen is the models aren't monitored and there's actually item number six on this list and so we had one company tell us that they deployed a model to production and then the model went wrong and they lost an unmeasurable amount of money as in they actually couldn't figure out how much money they'd lost because the model had gone wrong because they've had monitoring it and there are unique challenges around monitoring models that are different and monitoring regular software Microsoft's is that I'll talk about in a minute and some other problems that we've seen that people end up wasting a lot of time so if there are multiple data scientists or machine learning engineers collaborating on trying to trying to work on a model together then a lot of time is wasted just trying to set your development environments up so that they are similar enough to that your work is that the work that someone does is compatible with your environment or spending a lot of time copying datasets around between different machines it's also kind of ties into the third point here which is a lot of inefficient collaborations so we've heard people saying things like oh we just keep our data science teams small and we keep them in the same room so so they can collaborate by talking and and obviously in the current climate that is particularly challenging because you don't want to put people in the same room anymore and and even using paper notebooks to keep track of their work that they're doing and the results that they're keeping track of and and that relates to the fourth point which is manual tracking we've seen an awful lot of companies that they put the code like the some reference to the version of the code and the data and the premises that they used when they're training a model along with whatever like met metrics like accuracy score for that model they either put that in a wiki or on paper or in a or in a shared spreadsheet we've seen an awful lot of shared spreadsheets kicking around and there's another problem with those with those spreadsheets is that there's a lack of reproducibility and provenance and what this means is that you can't go back from a model that's running in production - well what data was it trained on and where what parameters were used like who even created this model do they even still work at this company like being able to keep track of the work that went into creating a model causes lots of complexity when when when you're trying to scale up the team or when the number of models that you have increases but the key point here is that if you can't deploy your models into production at all to begin with then you're not going to get to the late problems of a lack of reproducibility and provenance and so actually the most fundamental thing we need to solve in ml ops is making it easy to get models into production in the first place um and so this is kind of oddly familiar because we've been here before and if you look at how software engineering was in the 90s it was very siloed not everything was in version control there was no continuous delivery and it took a long time to ship software when it was being manually integrated and when there were these long like QA and release cycles and now software ships in visit in minutes in some times in some cases seconds modern tech companies that are shipping software will often say when they onboard a new employee well on your first day your job is to make a life change the products a small one but you push it through the continuous delivery pipeline and get it out we make dozens of changes every hour in some cases to different micro services but but AI is kind of still stuck in this pre DevOps world of collaboration and the large part of that is because the tooling isn't up to scratch so we have this kind of manifesto so this is what we care about and why we get out of bed in the morning is because we we're working on trying to solve this ml ops manifesto and the manifesto is in the form of four tests so you can kind of apply these tests to your own ml ops pipelines and you can form an opinion about how mature you are against the different the different requirements here so the first requirement is that your model training and deployment pipelines have to be reproducible well that means a good test for this is if I can come along nine months later if someone else can come along nine months later and retrain a model that was trained by somebody else who without even talking to them whether let's say an old version it tends to flow on an old dataset with on hardware that is sufficiently equivalent that they can retrain the models within a few percentage points then you've got a reproducible ml ops pipeline and if nine months later you can't because you upgraded the version it tends to flow on your development machines and the date has gone somewhere and you don't know where the data's gone the date has changed in your production database then you've failed the reproducibility test and if you fail the reproducibility test then you're in trouble from a governance and compliance perspective in some industries the second test is is your ml ops pipeline accountable and we talk about accountability from the same perspective that we hold humans accountable for their decision-making process and one of the ways in which you do that is you say on what basis did you make your decision and the on what basis question with machine learning as a minimum requirement not even going into the whole area of explain ability but as a minimum requirement you have to be able to say what version of the data was the model trained on so you need to be able to track the model back to the provenance of where that model came from what data it was trained on by whom and and so on the next point and it's especially pertinent at the moment is this collaboration requirement so it has to be possible to do asynchronous collaboration and this is something that software devops has got sorted and ml ops doesn't yet mostly and this means that I need to be able to if for example if if my colleague Chris is working on a model I need to be able to make a fork of that model and I need to be able to make changes to it without treading on Chris's toes so we both need to be able to collaborate asynchronously and and get useful work done now this has kind of influenced the design of what we're building to a large extent because because we believe very much in the sort of github request style of collaboration the data scientists are familiar with and and there are some challenges in in making that possible for 4ml and then finally the model development process has to be continuous and so there are a couple of things that I mean by this the first one is that the development process must be automatic the deployment process sorry must be automatic so it must be possible to automatically deploy a model into a staging environment or production environment without manually emailing Jupiters of notebooks or or tensorflow files serialize test flow models around because as soon as you start doing things manually then it introduces this possibility for the human error and the other piece is that you have to be able to statistically monitor your models and this is interesting because monitoring models is specifically is quite different to monitoring regular software that you might deploy it as micro services and the reason for that is that when you monitor software you can monitor things like latencies and error rates but when you monitor Mike Rosario when you monitor models machine learning models they can be giving you perfectly normal latencies and perfectly normal error rates and the model can have gone completely haywire and the reason for that is basically if you already knew the right answer for what the model was predicting then you wouldn't need the model in other words the production data is unlabeled and so this means that it's challenging to understand the behavior room of your model once it's running in production so an example might be that I might have deployed a model for four autonomous vehicles that classify road signs and so you might have a bunch of cars driving around with models running on Hardware in the cars and sensors cameras basically on the cars that are looking around for the road signs and if you already knew what road sign the sensor was looking at then you wouldn't need the model right but at the same time it means that it's hard to understand the behavior of the model of production and there are some solutions to this including looking at the statistical distribution of the classifications the model is making if it's a classifier and then you can say well if the actual distribution of classifications drifts very significantly from my expected distribution like the distribution that I used in training in the training set then maybe Paige a human like fire and alert and get a human to look at what's going on because either you deployed a bad model in which case well you need to know about it so that so that you can roll back and so that you can figure out what went wrong with the new deployed model or the world changed and especially with things like computer vision it's it's often surprising like how the models actually distinguish features in the data and and you can get stupid things like the the computer vision model might never have classified any or never it never be trained on any stop signs in the snow and for some reason it can't classify stop signs in the snow so suddenly it snows over a large part of the country and then your stop sign classifier stops working and obviously you're in trouble so you need to have that statistical monitoring and so those are the requirements and so I'm going to talk about how how we can try and address those requirements using ml ops tools and and just before I do that I want to talk a little bit about how the the software life cycle sorry the ML the ML ops life cycle is just fundamentally more complicated than the software development life cycle so when you're doing software development and life is easy it kind of it's easier anyway because all you need to do is you need to think about versioning specific versions of your code because as long as your doctor izing things then you ought to only need to know exactly which version of the code was used plus the doc file and dependencies in order to reproducibly build the same or effectively the same deploy the last fact and so you can build you can create code you can test that code in CI you can deploy that code into production and then your monitoring system can tell you how well it's performing maybe you need to upgrade a database maybe you need to go and optimize a certain code path and you can go around this loop as quickly as you can that's basically what we're doing with tables but with machine learning in ml ops you've got this fundamentally more complicated thing going on which is that you've got data coming into the system which is a major form of entropy and you've got code that's that's being written to train the models and you've got parameters which are being fed in to the training the models that model runs the training model run the model training runs sorry and all of these things combined and before you even train the model you've also got the data runs which is when you're doing feature engineering or your filtering or splitting the data and so you're executing a certain version of a piece of code against a certain input dataset and creating an intermediate dataset that's what a data run is or you're you're running a certain version of a piece of code against an input data set like a test training and validation set and then you're creating a model as an output and so when you're doing ml when you're developing machine learning models you are doing these data runs and these model runs whether you're keeping track of them or not these might be the cells that you run in your Jupiter notebook or the Python scripts that you run on your laptop and these runs are happening and you just haven't given a name to them and so what we've realized is that it's really important to introduce this notion of run as a kind of a building block it's a fundamental object in the system when you're doing Emma Lots it's this version of this piece of code ran in this environment with this input data and these parameters and it created this output and one of those types of outputs like I said is models and then the model itself the serialize model file is the deployable artifacts it has some metrics associated with it like how well it performed against the validations data and then it's that model that you're deploying into production and then you're monitoring in the way that I described that monitoring models is different to monitoring microservices and then you're monitoring might actually cause you to go back around all sorts of different loops through this diagram so that's kind of the the flow that we're looking at here so I'm a little short on time so I'm going to skip over the lifecycle quickly the lifecycle is really just a visual representation of how the diagram I just showed you is cyclic and so you do data engineering you need to keep track of the data runs when you're doing things like feature engineering and then you train a model which means you take a certain version of the data and you you train the model and outs of the model development process come serialized models typically docker images with serialize models baked into them and then those serialize models get run in production or staging and then production and then there are these kind of two feedback loops there's the statistical monitoring a feedback loop which is the first one where you can say show me the behavior of my model in production in real time based on like the distribution of classifications it's making and then there's also this slow feedback loop which is that you retrain the models with new data as you get new data coming in and of course the model making decisions can change the world and so can also impact on the data that is being recorded in the database so so that's kind of the the ML ops life cycle and and hopefully I've adequately answered the first part of the talk title now which is what is ml ops my opinion is that ml Ops is not just about operating models it's actually about the entire lifecycle of doing data engineering training models and then getting models into production and it's about being able to implement this process in in your company or in your organization with the same level of rigor that sort of best-in-class DevOps and software engineering teams manage and so let's kind of take a step back there and say well how can how can properly implemented ml ops helped me in in the current global situation that we have where we're now all suddenly working from home and in brackets if I'm a data scientist or or a manager of data scientists well the answer to that question comes down to the collaboration piece that I mentioned earlier so that's kind of do a deep dive on collaboration and so there are two way there are two fundamental modes of collaborating between different people doing doing work and there's synchronous and there's asynchronous and synchronous collaboration is when people are sitting in a room together and they're interrupting each other when they have a question and when they need to get something done and in particular with machine learning it might even be that they're sharing an environment so they they might even be pair programming sharing a text editor so they take it in turns to use the keyboard they would be working on exactly the same data set in exactly the same environment and then you're kind of time slicing it well hopefully pair programming is is effective and okay so people can't make changes to this thing at the same time but we just take it in terms or there's an asynchronous approach which is to say that different people should be able to do work on different copies of things and problem with an asynchronous approach of course is that you then need to be able to cope with conflict it's like merge conflicts in in get um which takes me on to the second point here which is how how does software DevOps seem to do it well with github basically and tools that are like github like git lab and bitbucket and and all of the tooling around that and the the way that that works as I'm sure most of the people in the audience know here is that you can fork someone else's project effectively or if you're or you can make a branch from master you can make some changes in your branch and while you're making the changes in your branch you're not trading on anyone elses toes when you then propose those changes back to the master branch that's when you can pull in new changes from the master branch and integrate them into your branch and that's when you have to deal with merge conflicts and then you can propose a version of the change which is up-to-date with respects the master branch that's basically it's commonly known as git flow and it's been very successfully used in pretty much every team on the planet people have tweaks to that approach with multiple master branches and so on but it's all fundamentally the same idea of asynchronous collaboration and then with DevOps team's adopting things like git ops where your source of truth for what's running in production is also in a git repo well you can use tools like terraform or your commercial control your kubernetes yeah yeah Milles and then happy days you can use the same collaboration approach pull requests when when you're deciding to scale up the cluster or deploy a new database and doing the other things that DevOps teams do so so how can you so what are the challenges are playing asynchronous collaboration to machine learning well they are numerous and the first problem is that it's the jupiter notebooks don't version control very well and a lot of data scientists use jupiter notebooks another challenge is the data versioning and data sharing is is difficult in in a machine learning context sorry in in a collaboration context because you can't very easily put your data and get there is a project called git LFS but it has some significant restrictions and and so what we find is that teams normally just don't bother the data versioning and they just rename files or rename folders and have folder names with like underscore final and underscore final final v2 and all these funny little strings that refers the things that have been done and then they have to share those folders around and it becomes quite messy the other thing is metric and parameter tracking you didn't have to do that when you're doing software but you do have to do it with machine learning you have to keep track of which parameters you used in which accuracy score you got because models aren't either sort of green or red they're not either working or broken they're kind of somewhere in the middle and the metrics like the accuracy score will tell you how how good a model is against a certain test set and so you need to keep track of that and you could put that in the in the get commit message but then you have to have human remembering to do it people are bad at remembering to do things or you can use tools that help you with experiment tracking and there are challenges with with using a combination of local development environments you might have a GPU or an IP you in the machine in your desk or you might be using machines in the cloud and with gifts it's quite easy to switch from a local machine to machine in the cloud or a machine in your datacenter but doing it effectively and with machine learning where you've also got data that you need to move around and you've got to keep track of metrics and parameters and maybe you can't even really use get with your dupes of notebooks and still do effective collaboration it makes it all a lot more challenging with respect kind of moving around where you're where you're doing the work and so there are some tools that that help to solve these problems for machine learning and it's actually a very exciting space and there's lots of new innovation happening around this and so obviously I'm from dot science I'd love it if you started using dot science but I also wanted to give a knowledge meant to the fact that there are lots of other tools out there and so ml flow is quite strong in the experiment track experiment tracking space weights and biases is very strong in like comparing relationships between metrics and hyper parameters dvc is a promising project in terms of doing data version control as is a project called pachyderm and then in terms of dipping and merging groups and notebooks there's an open-source project called MB dime in fact many of these many of these projects are open source and and so what we've tried to do with dot science is bring the capabilities from these kinds of capabilities into a single tool and so I'm going to attempt in the next five to ten minutes a very quick live demo of the collaborations tool that allows you to collaborate on Jupiter notebooks and do data versioning and sharing keep track of metrics and hyper parameters and to move around where you're doing your work and automatically synchronize the data and so on for you so what I what I started doing earlier was was kicking off these runs and so let's see we have a couple of different projects here and we've got a couple of different users as well so what we've got here on the left-hand side is this user called luke that's me and I have a collaborator over here with the yellow browser whose name is Fred um and Luke and Fred have been doing some work on a project and so you can see that the indoor science you can you can see the the work that they've been doing and you can compare the runs that they've been doing on this project and so you can see that sort of in the mists of time five days ago Luke did this run which used one epoch and the STD optimizer and got an accuracy score 58% and then Fred did this pretty awesome run that got 92% accuracy and the luke tried another combination of parameters and got an accuracy score of 84 percent and in this view this is kind of Fred's view of the world so Fred is operating on a fork of Luke's project so let's go and look at the the history of the runs in the project and so we can see here well back in the mists of time somebody converted a sign names file from CSV to JSON so it looks like this is a road science project so this classifier is is going to be used to classify road signs and so what we've what we built here is it shows the ability to keep track of the provenance of data so this is this is an example of a data run so we have this certain version of this code right in this certain version of the simple data and it wrote out a certain version of an intermediate data set for classes or JSON file from Simon's dot CSV you can see who did it you can see which docker image was used so that kind of pins down the environment and you can also go in and look at the files and so you can see it was exactly this Joosten notebook was used to to convert these sign names from CSVs to json and so what we've introduced is this notion of the dot science library and in the dot science Python library you can read in you can do things idea so I'll start which says this is a new run and I'm gonna do something here that is constitutes a separate run and execute some code that reads some input and creates an output basically and then I'm going to mark certain files as input files and other files as output files and then I'm going to DDS not publish and every time you run dear stuff publish what we do is we take a lightweight filesystem snapshot and that lightweight filesystem snapshot contains metadata about contains both the actual data that was used so it versions the data set that's there on your file system in your workspace and it also records the relationship between certain input files and output files and so we can see okay we converted sign names it looks like we also downloaded some data so we downloaded some data from the internet here we don't train this test flow model and then the tensor flow model has the most interesting provenance graph that we've seen so far which is that it reads in a certain version of the dataset and it writes out a model and this model the serialized model is the thing that can then be deployed into production and so from in here we can deploy a model into production and we've already got a few running we've got I know we don't have any of these models running actually but we have models that that are built and ready to go so from that those files on disk we can then build a model into into a doctor image and deploy it to the kubernetes cluster but I'm not going to show you the deployment in the monitoring stuff we can probably talk about deployment and monitoring you know in a later meetup if if people are interested and instead we wanted to talk about collaborations so I tried something earlier which was that so we've already had a pull request in this project previously which was there Fred said hey she'd get better at see if you if you try these these parameters and so you can see that the diff of exactly both the model files that changed when that pull request was made but also the the parameters that were changed and then you can also so let's let's see it looks like Luke and Fred have made conflicting changes to this model because since sort of the last common snapshot here the last common run they both made new runs which change the same thing so those the most recent run here uses the ADA max optimizer and got 84% and this one went back to using this GT optimizer for one epoch for whatever reason so if we go here this is just like the get approach which is that it or the github approach which is if you want to make the pull request you first have to get your your local branch up-to-date with respect to master and in this case we're going to think of Luke's version as master and so before Fred is able to upstream these changes he's going to have to go ahead and update the project from from master because the fork is two commits ahead and two commits behind and so let's see I just need to click this button to update from origin and it's going to copy a bunch of files into into my fork and so it's kind of copying Luke's latest version and at that point it will be able to show me the diff from from one version to the next exactly what's changed and it will also be able to provide a three-way merge algorithm that allows me to say oh I want to take this change but not that change and and so on so so that's that's what we're working on it seems like this is going to spin forever so so obviously we didn't praise the demo gods hard enough but you can kind of get the idea of things that we're doing and hopefully hopefully this hopefully this was useful so I'm going to open the floor up now to two questions and I think if we experiment with the zoom UI here we should be able to we should be able to invite everyone into the conversation to to ask questions with their voice if they want to and I see that we've also got some questions in the in the chat so I'm going to stop sharing my screen and yeah super look this is Marc here and now will lead had a couple questions in the chat both about ml ops and then also a question about dot signs if you want to take a look at those I think Luke the way it's set you have control to unmute everyone cool so I think I can't do that all in one go but I can invite I can promote everybody to panelists so I'm going to try and invite everyone to be a panelist and then if you want to kind of come into the room as it were and ask questions then feel free and then hopefully in the next ten minutes that I don't know about everyone else I can go a few minutes over then people are welcome to unmute but yeah Chris can Chris and Mark can you see the the more button let's see the attendees that says promote panelists I don't have that video okay you're just gonna go through and pick that button for everybody well look well maybe I should do that what lead looks like you've been kind of promoted as it were if you want to put your questions pull them out of chat and put them into the real world and I think glucan speak them that way you'll have to admit to if you're there there you go I see some awesome backgrounds Luke want to make start with the leads message there in the chat I think that question is how about ml operations provisioning infrastructure platform frameworks in a consistent repeatable way where does that fit in ml out yes I'm nearly done clicking the promote to panelist and as you know mark I'm bad at multitasking so I'm not even going to try okay cool so everyone's a panelist now so welcome to the room we sorry Willie says it's not sure if the mic is working so so I'm just going to go through these questions in the chat but feel free folks to unmute and get involved here so how about ml operations provisioning infrastructure platform frameworks in a consistent repeatable way where does that fit into Emma logs I think that's definitely part of the job of an Emma Watson at form is to enable the provisioning of infrastructure so so one of the things that we've done for example is we've made a terraform repo that you can use to spin up the science deployment and then from that all science deployment it then hooks into the underlying cloud provider to allow our data scientists to self-serve compute from the underlying cloud provider so they can provision a machine with the GPU if they need to for period of time while they do some training and then they can switch it back to a CPU and we've also made it so that those those machines shut themselves down quickly within within a couple of hours of becoming idle to save on costs and then a related point is that it's kind of a multi cloud point is that it needs to be possible to to attach compute from multiple clouds and also from on-prem and so I definitely see it as part of the responsibility of an ml ops platform to to include that that component basically the necessaries of storage and network virtualization that you need to be able to to attach compute in a flexible way from from all over the place from a machine sitting onto your desk or or a machine in the cloud I hope that answered your question I lead so there's also a comment here if I understand correctly does science as a multi-tenant multi project the teams in mind and has some sort of into in data science in mind yes that's correct Eduardo is your mic working you wanna ask your question on voice Hey hello hey how's a guy I can hear you yeah let me try to see can you see me no yes greetings hey hey I'm convert light soul every question in fact I share in the shot but so I found it very interesting you use the word of the term collaboration tool why why you choose that term because I you as I understand you are trying to achieve or at least to define and why when I think collaboration I think communication hmm so yeah acting like a communication flow yeah definitely and I think that comes down to what we're basically trying to do is mimic github but make it work for machine learning and that means you have to keep track of data and you need to keep track of models and you need to keep track of metrics as well as keeping track of versions of code and it's by packaging them together as runs is what enables that to happen but in terms of the communication yes it's about not having to jump on a slack call or a zoom call and interrupt someone if you want to collaborate with them instead it should be possible to do things I synchronously in the same way that the software engineers are familiar with as doing pull requests and working on branches and then you can have your with your pull requests reviewed asynchronously by someone who's awake when you're asleep if you're working in different time zones mm-hm so cool yeah any other questions or comments from the from the group here yeah we might finish on time then cool so yeah I just wanted to to wrap up then and say thank you so much for coming I think we're planning on running these every week we I think we have our next speaker for next Wednesday confirmed already he is a guy called Charles Radcliffe who is formerly the head of AI for fidelity international so he'll bring sort of an enterprise perspective fidelity as I'm sure you're aware is a pretty big organization in the financial world and fidelity international in London is this way he's hailing from and he's going to talk about best-in-class AI governance in financial services and yeah if folks have other suggestions for talks please shout out in the slack channel or post message on the forum and yeah thank you all very much for coming to the first ever ml ops community Meetup thanks so much appreciate it thanks again everybody out there for joining the conversation and I hope to see you next week awesome and yeah we're going to follow up by the way with an email with a link to the presentation in it and yeah if you're not already on our slack Channel please come and hang out on slack awesome thank you everyone
Original Description
The 1st MLOps.community meetup on 3.18.2020 featuring Luke Marsden from Dotscience.
What is MLOps and how can it help me work remotely? The first episode of our weekly MLOps community virtual meetup with CEO and founder of the MLOps platform dotscience Luke Marsden talk to us about the current state of Machine Learning, what some of the main difficulties are at this stage when developing models, how the machine learning lifecycle differs from traditional software development and a deep dive of collaboration for data science teams in a fully remote world.
MLOps is the intersection of three disciplines: software engineering, DevOps and machine learning. MLOps refers to the entire end-to-end lifecycle of getting models from lab to live where they can start delivering value.
What do software engineers and DevOps need to learn about machine learning to ensure that it can be integrated into their dev & deployment pipelines? What do data scientists and ML engineers need to learn about DevOps, model deployment and monitoring to ensure they can effectively deploy their work without racking up tonnes of technical debt? And now that working from home is fast becoming the new normal, how can MLOps help my team stay efficient when asynchronous collaboration is needed, something our software engineering and DevOps friends have already mastered?
MLOps is a complex discipline due to the many more moving parts involved than regular software DevOps, in this inaugural MLOps.community meetup we'll explore and navigate this new space together and give you a guide on how to avoid the most common pitfalls and challenges getting AI into production and collaborating effectively with your team – even when you're distributed.
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, Feature Store, Machine Learning M
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from MLOps.community · MLOps.community · 1 of 60
← Previous
Next →
▶
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1
MLOps.community
Remote Collaboration as a Data Scientist
MLOps.community
MLOps Manifesto with Luke Marsden from Dotscience
MLOps.community
MLOps lifecycle description
MLOps.community
What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2
MLOps.community
Life purpose and too many spreadsheets
MLOps.community
Explainability, Black boxes and EU white paper on reproducibility
MLOps.community
Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3
MLOps.community
Automatically Retrain Machine Learning Models? Are best practices worth it?
MLOps.community
Building an MLOps Team? Key ideas to keep in mind
MLOps.community
Hierarchy of MLOps Needs
MLOps.community
Bare necessities for getting an ML model into production
MLOps.community
MLOps and Monitoring
MLOps.community
How Phil Winder got into Data Science and Software Engineering
MLOps.community
Provenance and Reproducibility in Machine Learning; what is it and why you need it?
MLOps.community
Friction Between Data Scientists and Software Engineers
MLOps.community
MLOps Problems in different size companies
MLOps.community
ML tooling in large companies
MLOps.community
ML Platforms - The build vs buy question
MLOps.community
ML Services Gateway at SurveyMonkey
MLOps.community
Message buses, Async and sync architecture
MLOps.community
MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey
MLOps.community
Hybrid Data Science Teams @SurveyMonkey
MLOps.community
How do you handle ML version control at SurveyMonkey
MLOps.community
Doing ML with Personal Information
MLOps.community
Evolution of the ML feature store @SurveyMonkey
MLOps.community
Developing a Machine Learning Feature Store
MLOps.community
Auto retrain ML models is not the question
MLOps.community
3 key parts to Machine Learning monitoring
MLOps.community
MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali
MLOps.community
MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio
MLOps.community
MLOps: Airflow Pros and Cons
MLOps.community
Specific challenges in Machine Learning
MLOps.community
Current State Of Machine Learning
MLOps.community
Humans in the Loop are a defining factor in Machine Learning
MLOps.community
Learning from real life Machine Learning failures
MLOps.community
Survivorship Bias in machine learning tutorials
MLOps.community
Swiss Cheese model in Machine Learning
MLOps.community
Resume driven development in Machine learning & software engineering
MLOps.community
Who has the highest standards in ML?
MLOps.community
Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning
MLOps.community
Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data
MLOps.community
Speed, Trust, Evolution and Scale in MLOps
MLOps.community
More difficult transition for data scientists to become ML engineers
MLOps.community
How many models in prod til I need a dedicated ML platform?
MLOps.community
Deeper thinking from data scientists around platform blackholes
MLOps.community
Checkpointing, metadata, and confidence in your data
MLOps.community
Adjacent usecases and multistep feature engineering
MLOps.community
Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali
MLOps.community
Reproducability flaws in end to end Machine Learning debugging
MLOps.community
3rd wave of data scientists
MLOps.community
MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline
MLOps.community
MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0
MLOps.community
Are Kubeflow and Airflow complementary?
MLOps.community
Why Kubeflow gained so much traction=open community
MLOps.community
Who decides the dirrection of Kubeflow
MLOps.community
What do Kubeflow and Arrikto do and how do they work together?
MLOps.community
Versioning your ML steps with Kubeflow
MLOps.community
Machine Learning Lifecycles//Perception vs Reality
MLOps.community
Kubeflow vs SageMaker in Machine Learning
MLOps.community
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
DevOps Took 10 Years to Mature.
Medium · DevOps
Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI
Medium · DevOps
Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx
Dev.to · Shannon Dias
MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI