Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Weights & Biases · Beginner ·📐 ML Fundamentals ·6y ago

Skills: ML Pipelines80%ML Maths Basics60%

Key Takeaways

The video discusses testing and deployment of deep learning models, covering topics such as continuous integration, containerization, and model monitoring, with tools like Docker, TensorFlow, and TF Serving.

Full Transcript

cover today just quick overview of what to expect from today first I'm gonna do a very very quick recap of the lecture that you all watch from Sergey on testing and deployment and it's going to go by very very fast so if you did not do your homework then I'm not sure that that that this will bring you back up to speed but that's that's the intention and then I want to just check in briefly on projects and just kind of see how everyone is feeling about where they're at we have you know week left in the class and so I want to get a sense of you know how close to being done everyone feels and then we have two amazing guest speakers to kind of conclude this part of the class are any questions before I get started great so just to briefly review the lecture that you watched from Sergei he started by covering some concepts of testing and deployment and so these concepts were kind of you know how to think about the entire structure of your machine learning projects and how different tests fit into that and then he covered this idea of the ML test score which is a rubric from Google of how to think about how production-ready your machine learning code bases and then he started and then he talked about some of the infrastructure and tooling around this sort of part of machine learning projects so we talked about continuous integration and testing covered docker in some depth some ideas for deploying to the web monitoring product prediction systems once they've been deployed and then a little bit on kind of how to think about deploying not to the web but to hardware or to mobile and so I'll talk through a couple of the key slides from the talk the first thing I really like is this this kind of overview of machine learning systems so you know you have your training system and your trading system is combined with your training and validation data to create a prediction system which is then served into production and the key concept here is the different types of tests that you might have for the different stages of of this codebase so you have tests on your on your training system and this is things like you know if you push some update to your code is it at breaking your ability to achieve a certain score on your training set and these are kind of longer tests that take you know maybe a two a day to run then on your prediction system you have validation sets and these are you know testing regressions to your model itself so if you push an update to your model you want to make sure that it's still performing as well as it did before and then functionality tests which are quicker tests that can catch kind of making sure that you perform well and like really important examples or edge cases and then finally once you've deployed the system into production you want to monitor it so you want to make sure that you know it doesn't go down you don't have data shifts and you don't have like more errors than you're expecting to have this is a slide that covers some of what was talked about in the ml readiness score and so it just talks about some of the different types of tests that you might want to have for your data set your model infrastructure and then you know monitoring and production I won't go through all these now but these are just some things to think about as you're as you're writing tests for your your your code base for the project and then diving into testing and continuous integration if there's a few concepts here there's unit and integration tests and so this is you know testing individual parts of your code base to make sure that they continue to function when you change your code and you know possibly testing your entire system and then there's a concept of continuous integration and all this means is that you know every time that you push new code to your repo let's say before you deploy a new model into production or sometimes even in some organizations before you merge that code it to master you want to you want to run some tests to make sure that that code is not sort of broken what you were able to do before there are a bunch of software as-a-service tools for continuous integration most of these are not in fact I think all these are not specific to machine learning but these are just some of the tools that you might experiment with and then another core idea that Sergei talked about was containerization and so this is kind of a way of managing the dependencies of your code when you run it in a continuous integration or deployment setting so that's kind of what was covered in testing for deployment a few concepts here the first is a rest and so there's just a general API for HTTP systems and so like one way to think about deploying machine learning system is just treating them as kind of a black box that's called by a web server you have a bunch of different options for deploying machine learning code into production you can you know put the code into a virtual machine like a docker container and then you can scale it up to more users by adding instances to your tier system or you can do it via orchestration and then the last concept that Sergei talked about that he really likes and I think is really exciting is serverless functions where you don't actually have to manage your own your own infrastructure at all and the takeaways here were you know if you're doing inference on a CPU then you can get away by you know scaling just scaling up by launching more and more servers or by going serverless and you know you don't really have to do anything too crazy here Sergei is Sergei is dream which i think would be really cool is just you know deploying docker as easily as deploying lambda but the next best thing is you know either using lambda and dealing with the fact that the the form that you have to get your model into is much trickier or using docker and it just kind of depends on what you need for your model and what your priorities are which side of the trade-off you land on if you're doing GPU inference then this becomes more tricky and there's you know more specialized tools like TF serving that you should look into all right so that's deploying on the web and what about deploying into hardware so the core challenge here is that you know your cell phone does not have the same amount of processing power that you can get on a server and so you often have to use a bunch of tricks and Sergei talked about some of them to reduce the size of your network and maybe quantize the weights another challenge is that the frameworks that people use on mobile are actually less full-featured and so you might need to choose your model architecture specifically to be one that can run on mobile there are a few options for doing this and tensorflow there's tensorflow light and tensorflow mobile and there are also few that are kind of more specific to different hardware platforms so Apple has a platform Google has a platform and then there's this this Fritz option that you know claims to be able to work well with both okay great that was the lightning five minute overview of Serge's 90-minute lecture so I'm curious if were there any questions about the lecture concepts that he covered that you would like to talk about yeah yes so sir you mentioned how it goes all the way to the doctor where each Bates I was just wondering what benefit does that provide oh we're just containerized with the whole thing yeah I think if you have like different components of a larger system that need to interact with each other then it could be helpful to just isolate each of them so they have like sort of a very like small surface area of their API and so you kind of know what to expect from the different components of the system interacting with each other it could make it easier to test could make it easier for you know multiple people on a larger team to work on different components together and have kind of aspect that they need to meet for each other other questions yeah serving which part see overlook yeah for I think for us I'd open a tie I would say and I'm curious what what Peter thinks about this as well but um I feel like training is the one that was overlooked for a long time like we would often have you know we would really really push one part of the code base forward and then we would go back to like some model that we got working months ago and we and find out that we could no longer train that model so I think that's one thing that's like really easy to overlook other questions on this usually when you just said to retrain something are you saying you trained on another model forward and then try to go back to yeah so say like say you have you know two or three tasks that you're working on as a team and you have like a mono repo that you're using for all them and you know you solve the first two of them and your model works really great and you know you have some you have some weights for those models that you're happy with and then you know most of the team goes and works on the third component one challenge there is then you know they could push on that third component for two or three months and make some braking changes to the to the training part of the pipeline for the first two components without really realizing it and you know if you're even if you're deploying those the first two components into production or something like that you still might not notice because you might you know the pre-trained model might still be working but the action your ability to actually get the loss down on that model might have disappeared yeah and I think like Jett more generally outside of the opening a context something that's less of a problem for us but a lot of people complain about when I talk to them is is monitor is production monitoring because you know it's just it's really really easy to have like data data drift or you know or to have like some sort of like weird input go into your pipeline that you weren't accounting for and just break things without even realized that they're realizing that they're breaking and so I think a lot of teams put a lot of effort into figuring out how to do that really well yeah I would like to know if the distribution is to where I'm used to one type of images and the users that are requesting are getting another source for the images what sort of metrics can I yeah I think Sergey talked about a couple of them in a lecture and this is sort of something that I like have done less of personally but I think you just want to look at stuff like the statistics of the inputs and outputs yeah so just like the value of the pixels right and so I mean you could imagine doing trickier things to like you could look at the output distributions of your model and and you could you know maybe most of the time your model is produces pretty confident predictions and if the confidence of your model starts degrading that might be an early warning that the data distribution is shifting yeah but I think like input input and output statistics so just like you know over the last n images would have been the sort of the average values of things and you know over the over the last n like classifications would have been the classes what's the distribution of classes that we've looked at that seems to be the most common thing that people do from talking to people would it be like yeah I mean I think so like if you have you know if you have like a character level model let's say then you could just look at the distribution of characters and if that shifts pretty wildly then you know that like maybe you're getting sort of a weird type of input into your model and you should just make sure that the distribution isn't vastly different than what you trained on yeah for best practice do you have to bother wait sitting somewhere outside and the prophecy yeah I think and someone else who has sort of done more of this should try it should chime in but I think what you do well you I think you definitely want to have your weights inside of lambda otherwise you're gonna have to call out to something else and it's gonna be really slow I think like so there's one bag of tricks which is like all these you know model compression quantization type things that's like how to actually get reduced the size like the number of parameters and the size of the parameter the parameter matrices in your model and then you know aside from that it's like just about minimizing the dependencies that you have I think any any comments on that like any other tips from people who have deployed to lambda [Music] so trying to make those features of them getting rid of them yeah yeah you can you can use more a more stripped-down version of tensorflow I think that's a really good thing to do as well yeah [Music] so what are some examples of those things if you're doing an apple next Giannini density if you that you think yep yeah I looked at a lot [Music] and we do [Music] so you should look into some of the tools that that Sergei talked about around deploying see which slide this was yeah so things like TF serving and clipper if you need to deploy a model that where you need to do inference on GPU is that getting your question to me to someplace like press a button and have your model automatically get deployed yeah it's a great question it seems like something like that should exist I don't know I'm cured actually I'm curious Lucas have you seen anything like that those people are and more but they want yeah because the new lived on sandwiches the promise the mothball give you an error message so just started every company worked with has a horror story but they because like you might not realize like this maybe you haven't done it like you really do end up with a lot of models about my three moms being in the like all the others fail model be like so great for my like relevance model let me just be that day and before you know it you actually have wait critical model and it's really where you end up right and then you know anybody changes one of those you can mess up the we did i perpetually this is like years ago say like a take my energy remember in Africa and Yahoo search if we were to play Machinery was it one point where she could play irrelevant swaddled had the first it was in a country that the language like rare enough that like yeah there wasn't a lot of QA it was actually like I mean you might that's it so winding me like stupid does actually work like it might do dahm if you really think about it but you can really yeah yeah another thing that people have talked a lot about that seems to really help here is like don't you know even if you are pretty confident that your model is going to work well don't just deploy it into production to like start serving predictions right away like what a lot of companies do is deploy it alongside the existing model that they know works okay and then just sort of keep track of all that stuff for a while or maybe even like serve predictions to you know 1% of their customers or 10% or something like that and then sort of only only over time when you build more confidence that the model is going to work in the real world then actually make that sort of your main production model living that and if you just conform to the TF estimator API then you can try to make it you have to use it you use that I gone through like the workshops cool any other thoughts on them testing deployment how many feel like you're gonna be able to like actually write some unit tests or you know regression tests around your code base for your project before next week one - it was pretty good yeah it's uh what's that I don't know we have a lot what do you think Peter like I wasn't we we're trying to we start we just started measuring our coverage like few weeks back we're at like 70% which is so on the order of hundreds at this point probably yeah yeah the other thing I'll say about this is that like you know I think like when I'm writing a research code I usually will start by not writing any tests but I think that one mistake people make is like when you're transitioning from research code to code that's like gonna be used by a lot of people or deployed into production there's like waiting too long to start writing tests because you know like the longer time you spend between when you introduce a bug and then when it gets caught the harder it's going to be to like to go back and find where that bug was and fix it and this is especially true if there's multiple people working on the code base yeah both actually so in some in some cases we you know we would initialize models and stuff like that would random weights and test them in various ways we also load existing models to me and stuff like that so we quite I think it used to be the case where our unit tests working out integration tests kind of really the integration test like you know pretending they were unit is pretending to be integration test and it would be like take a really long time like for example would actually train a little bit and stuff like that which is really not a good idea because it tells up if you have hundreds of unit tests them you have to wait wait for like an hour for all the unit tests to run so you have to be pretty careful about that but like ultimately now we're much more in a state of testing modules and taste testing the networks that they're behaving in the way we expect and stuff great okay the the next thing I want to do is just sort of briefly check in on projects you know we have a week left the the intention is for everyone to kind of present their projects next week and kind of you know just you know what you try to do where you're at and what some of the challenges were so I just want to get a quick sense like how many feel like you have kind of a model for your project that you know it's producing like reasonably good results and will be you know can be like kind of a nice graph or some nice outputs to to talk about next week okay relatively small number for those of you that are that are not quite at that stage what's the main blocker is it you know you just you have sort of a model training it's just not good enough for is it like you're still not able to get data training and it's not good enough yeah data quality wasn't good enough yep that can be a big problem other sort of like big challenges that people have run into something really slim and simple like a farewell sensor well but it should work so it's something that is more advanced takes about visual play keris you know Gareth's betrayed Rasmus works way better such a painting yeah just humor imagine just trimming down yep is just yeah okay so like model is working pretty well but deployment is also challenged yeah oh yeah yeah interesting okay and this is related to my question yeah other sort of big problems that people ran into yeah figure out interesting because why was that just because the and we just just really go over and why this particular there's Wow yeah so just like I sort of like change little to just from scratch working and so yeah cool yeah so like working with existing code bases can be really challenging because you know as much as we wish this were not true a lot of times machine learning researchers you know do not write the best most extendable code how many people actually like ran into like found bugs in existing code bases that you tried to use okay yeah how many people like use the existing code base and did not find a bug at zero all right yeah there's been my experience as well like I usually I I mean it's it's great to be able to start with an existing code base but I'll almost always end up reimplemented it from scratch myself just because you know I don't generally don't trust random machine learning code that I find on the Internet cool okay so I think like the the right way for us to do this next week is you know we'll have sort of a more detailed kind of suggested template for your presentations but the the like the the attitude that I want us to have is like let's be you know talk about like the the progress that you made and sort of how well your thing is working but I want to be like really upfront about what the big challenges were and try to spend a little bit of time thinking about like if you're gonna start this project over again from scratch you know six seven weeks ago what would you have done differently now that you know what the challenges that you faced were you know choosing a different project choosing a different data set different model because I think that will be really helpful for people to just kind of learn from your experience of what ended up being difficult in your projects great and stephy did you want to say anything else about the project presentations next week [Music] I just have one more thing to add to that um if you plan on projecting your presentation or using the projector it's easiest if we're presenting from a Mac just because we use airplay to pipe into these monitors so if that is an issue just let one of us know and we can try and plan some other way of projecting those nice lights I'll just use this some of the groups have a live demo yeah for me for our speakers

Original Description

In this Lecture, Josh Tobin of OpenAI recaps Sergey Karayev’s lecture on Testing and Deployment in Machine Learning which can be found here: https://www.youtube.com/watch?v=JTSwQu0OyGs This lecture was a part of the Applied Deep Learning Fellowship held at the Weights and Biases Headquarters in the spring of 2019. For more tutorials: https://www.wandb.com/classes To learn more about Weights & Biases: https://www.wandb.com/ http://josh-tobin.com/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 36 of 60

← Previous Next →

0. What is machine learning?

0. What is machine learning?

Weights & Biases

1. Build Your First Machine Learning Model

1. Build Your First Machine Learning Model

Weights & Biases

Intro to ML: Course Overview

Intro to ML: Course Overview

Weights & Biases

2. Multi-Layer Perceptrons

2. Multi-Layer Perceptrons

Weights & Biases

3. Convolutional Neural Networks

3. Convolutional Neural Networks

Weights & Biases

Weights & Biases at OpenAI

Weights & Biases at OpenAI

Weights & Biases

Why Experiment Tracking is Crucial to OpenAI

Why Experiment Tracking is Crucial to OpenAI

Weights & Biases

4. Autoencoders

4. Autoencoders

Weights & Biases

5. Sentiment Analysis

5. Sentiment Analysis

Weights & Biases

6. Recurrent Neural Networks [RNNs]

6. Recurrent Neural Networks [RNNs]

Weights & Biases

7. Text Generation using LSTMs and GRUs

7. Text Generation using LSTMs and GRUs

Weights & Biases

8. Text Classification Using Convolutional Neural Networks

8. Text Classification Using Convolutional Neural Networks

Weights & Biases

9. Hybrid LSTMs [Long Short-Term Memory]

9. Hybrid LSTMs [Long Short-Term Memory]

Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Weights & Biases

Weights and Biases - Developer Tools for Deep Learning

Weights and Biases - Developer Tools for Deep Learning

Weights & Biases

Introducing Weights & Biases

Introducing Weights & Biases

Weights & Biases

10. Seq2Seq Models

10. Seq2Seq Models

Weights & Biases

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

Weights & Biases

12. One-shot learning for teaching neural networks to classify objects never seen before

12. One-shot learning for teaching neural networks to classify objects never seen before

Weights & Biases

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

Weights & Biases

14. Data Augmentation | Keras

14. Data Augmentation | Keras

Weights & Biases

15. Batch Size and Learning Rate in CNNs

15. Batch Size and Learning Rate in CNNs

Weights & Biases

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Weights & Biases

Grading Rubric for AI Applications with Sergey Karayev (2019)

Grading Rubric for AI Applications with Sergey Karayev (2019)

Weights & Biases

16. Video Frame Prediction using CNNs and LSTMs (2019)

16. Video Frame Prediction using CNNs and LSTMs (2019)

Weights & Biases

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Weights & Biases

17. Build and Deploy an Emotion Classifier (2019)

17. Build and Deploy an Emotion Classifier (2019)

Weights & Biases

Applied Deep Learning - Data Management with Josh Tobin (2019)

Applied Deep Learning - Data Management with Josh Tobin (2019)

Weights & Biases

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Weights & Biases

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Weights & Biases

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Weights & Biases

Designing a Machine Learning Project with Neal Khosla (2019)

Designing a Machine Learning Project with Neal Khosla (2019)

Weights & Biases

Lukas Beiwald on ML Tools and Experiment Management (2019)

Lukas Beiwald on ML Tools and Experiment Management (2019)

Weights & Biases

Building Machine Learning Teams with Josh Tobin (2019)

Building Machine Learning Teams with Josh Tobin (2019)

Weights & Biases

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Weights & Biases

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Weights & Biases

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Weights & Biases

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Weights & Biases

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Weights & Biases

Organizing ML projects — W&B walkthrough (2020)

Organizing ML projects — W&B walkthrough (2020)

Weights & Biases

Brandon Rohrer — Machine Learning in Production for Robots

Brandon Rohrer — Machine Learning in Production for Robots

Weights & Biases

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Weights & Biases

My experiments with Reinforcement Learning with Jariullah Safi

My experiments with Reinforcement Learning with Jariullah Safi

Weights & Biases

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Weights & Biases

Testing Machine Learning Models with Eric Schles

Testing Machine Learning Models with Eric Schles

Weights & Biases

How Linear Algebra is not like Algebra with Charles Frye

How Linear Algebra is not like Algebra with Charles Frye

Weights & Biases

Predicting Protein Structures using Deep Learning with Jonathan King

Predicting Protein Structures using Deep Learning with Jonathan King

Weights & Biases

Rachael Tatman — Conversational AI and Linguistics

Rachael Tatman — Conversational AI and Linguistics

Weights & Biases

Reformer by Han Lee

Reformer by Han Lee

Weights & Biases

Sequence Models with Pujaa Rajan

Sequence Models with Pujaa Rajan

Weights & Biases

GitHub Actions & Machine Learning Workflows with Hamel Husain

GitHub Actions & Machine Learning Workflows with Hamel Husain

Weights & Biases

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Weights & Biases

Jack Clark — Building Trustworthy AI Systems

Jack Clark — Building Trustworthy AI Systems

Weights & Biases

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Weights & Biases

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Weights & Biases

Antipatterns in open source research code with Jariullah Safi

Antipatterns in open source research code with Jariullah Safi

Weights & Biases

Attention for time series forecasting & COVID predictions - Isaac Godfried

Attention for time series forecasting & COVID predictions - Isaac Godfried

Weights & Biases

Made with ML - Goku Mohandas

Made with ML - Goku Mohandas

Weights & Biases

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Weights & Biases

Deep Learning Salon by Weights & Biases

Deep Learning Salon by Weights & Biases

Weights & Biases

This video teaches the importance of testing and deployment in machine learning, covering topics such as continuous integration, containerization, and model monitoring, with practical steps for deploying models using Docker and TensorFlow.

Key Takeaways

Run tests on code changes before deployment
Use continuous integration and testing to ensure code quality
Deploy models using Docker
Monitor model performance and update models regularly
Use input and output statistics to detect changes in data distribution

💡 Model confidence degrades when data distribution shifts, and monitoring model performance is crucial for preventing failures.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 crucial Python concepts to elevate your skills from intermediate to advanced and become a proficient developer

Medium · Data Science

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer

Medium · Programming

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and separate yourself from beginner developers

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB