I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

Underfitted · Beginner ·📐 ML Fundamentals ·2y ago

Key Takeaways

The video discusses deploying machine learning models, specifically a recommendation model, and testing new versions using interleaving experiments to determine if the new model is better than the previous one.

Full Transcript

deploying machine learning models extremely important topic obviously so over the last few weeks I've been working with a company helping them deploy some machine learning models specifically they have a recommendation model I provide certain recommendations as the output and they want uh the problem that they're trying to solve is not how to deploy that model they know how to do that but how they can test that a new version of the model is actually better than the previous version and this is where I come and this is what I'm I'm helping them to do so today I want to show you the technique that I implemented and hopefully that will give you some ideas for your project it's very sophisticated I think or or at least it's very cool and remember the goal here is how do we know that a new version of a model is better than the existing version and when you talk to people they're going to tell you what you just evaluate the model and compare the performance of the model yes but the problem with a recommendation system with a model like this is that you cannot just explore how good the model is in a vacuum you need user feedback and that is the main component that is going to tell you whether your model works or doesn't work if you provide what look like good recommendations but people don't click on them then your model doesn't work so imagine that what you're trying to do is recommend additional products s based on the purchase history of a user the true test of whether those recommendations are good is whether or not users buy those recommendations they may look amazing on paper but if users don't care then your recommendations are bad that is the main challenge here so let me show you a diagram of the technique that I've been uh working with this company on implementing it's called interleaving expert expent and here the idea just so you follow the diagram is that we have a client that is going to send a request this is a web client that we have and the request is going to be give me recommendation for user ABC and then we have a prediction service this prediction service think of it as the API endpoint where the client application connects to the prediction service when it has only a single model it's going to use that model to just generate five recommendations and send back to the user what we was adding two models now so the prediction service instead of sending the request to the model now it's going to send the same request to two different models I'm identifying those models as Legacy model which is the current version that's deployed some people like to call this the champion model it's the model that's currently running and the candidate model which is this second model here in blue with the dott line the candidate model is the model that we want to test is it better than the Legacy model this is what we want to test right here so the prediction service is going to send 100% requests to both models and it's going to ask both models can you please generate recommendations but before before we had a candidate model all of the recommendations were coming from the Legacy model so the Legacy model generated these three pink recommendations and we will just sending those recommendations back to the client now we're going to be generating recommendations using both both models and interleaving those recommendations in a response so the client will not see recommendations from the Legacy model or the candidate model but it will see recommendations from both models at the same time so we inter leave maybe one recommendation from the Legacy model one from the candidate model one from the Legacy model one from the candidate model then one from the Legacy model to complete the five recommendations and this gives us a couple of good things so number one one we are hedging here and if the candidate model is horrible we are not destroying our application so imagine the candidate model generates good-look recommendations that people don't care about if we just swap the Legacy model by the candidate model well after a month it's just all of our purchases are going to go down the drain because the recommendations are really really bad so we don't want to do that instead we're going to be hedging and monitoring over time how good the candidate recommendations are and we will only switch to the candidate model when we are certain that those recommendations are really good in this particular case here let's say the user sent a request and we send back five different recommendations we can track what the users do with those recommendations are they buying the products that have been recommended by the candidate model or have they not bought anything from those recommendations so obviously it depends on how much traffic your site gets so assuming you get decent traffic you may need to run this for a couple weeks that's our case we run it for two weeks after 2 weeks we collect enough information to aggregate all of the purchases and determine is the candidate model unpair or better than the Legacy model if that is the case then we switch 100% of the traffic goes to the candidate model that becomes the champion at that point and you know we will have to build a new contender later on to just to run the same thing when we have a new version if the candidate model is not working well if people are not clicking on those recommendations then we can just discard that model improve that model come back with a new version later on something else that's also really important is whenever you're presenting a list of recommendations people will tend to favor recommendations at the top like if I give you these are my top five I don't know air conditioner units right people are going to check number one first they're going to tend to favor number one first even though you're going to say in no particular order it doesn't matter number one is going to get favored you need to keep that in mind when you are trying to compare the Legacy model with the candidate model like there are multiple ways to go about this like one particular uh technique is to randomize those recommendations so maybe you trust the candidate Model A little bit more and you are comfortable randomizing recommend the order of recommend ations so that will be one way of doing it the second way will be waiting recommendations based on their position in that list like anything at the top you will know will get more clicks so you will not just favor the model that you're always using to display the top one recommendation because it wouldn't make sense so hopefully that makes sense again this is uh not necessarily fairly sophisticated but it's a very cool way to test anything that requires user feedback it's a good way to test model with production real production data without having to put your entire system at risk by deploying a model that is not good enough you can test this in a back SE little by little you can increase the number of recommendations that come from that candidate model until you build the confidence to deploy that candidate model to make that candidate model your champion model so hopefully this helps and I'll see you later with more tips

Original Description

I teach a live, interactive program that'll help you build production-ready Machine Learning systems from the ground up. Check it out here: https://www.ml.school To keep up with my content: • Twitter/X: https://www.twitter.com/svpino • LinkedIn: https://www.linkedin.com/in/svpino 🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Underfitted · Underfitted · 44 of 60

1 Test-Time Augmentation In Machine Learning.
Test-Time Augmentation In Machine Learning.
Underfitted
2 Don't Replace Missing Values In Your Dataset.
Don't Replace Missing Values In Your Dataset.
Underfitted
3 Introduction to Adversarial Validation In Machine Learning.
Introduction to Adversarial Validation In Machine Learning.
Underfitted
4 Introduction To Autoencoders In Machine Learning.
Introduction To Autoencoders In Machine Learning.
Underfitted
5 Active Learning. The Secret of Training Models Without Labels.
Active Learning. The Secret of Training Models Without Labels.
Underfitted
6 Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Underfitted
7 The Confusion Matrix in Machine Learning
The Confusion Matrix in Machine Learning
Underfitted
8 3 Tips to Build a Career in Machine Learning (Unconventional Advice)
3 Tips to Build a Career in Machine Learning (Unconventional Advice)
Underfitted
9 I can predict cars CRASHING. And it's 99% accurate!
I can predict cars CRASHING. And it's 99% accurate!
Underfitted
10 A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
Underfitted
11 The BEST Machine Learning Interview Strategy.
The BEST Machine Learning Interview Strategy.
Underfitted
12 OpenAI’s Whisper is AMAZING!
OpenAI’s Whisper is AMAZING!
Underfitted
13 5 Lessons You’re NOT Taught in School
5 Lessons You’re NOT Taught in School
Underfitted
14 TensorFlow On Apple Silicon. Step-by-Step Instructions
TensorFlow On Apple Silicon. Step-by-Step Instructions
Underfitted
15 Generating Images From Text. Stable Diffusion, Explained
Generating Images From Text. Stable Diffusion, Explained
Underfitted
16 The Wrong Batch Size Will Ruin Your Model
The Wrong Batch Size Will Ruin Your Model
Underfitted
17 8 Mistakes Holding Your Career Back | Machine Learning
8 Mistakes Holding Your Career Back | Machine Learning
Underfitted
18 AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
Underfitted
19 Bias and Variance, Simplified
Bias and Variance, Simplified
Underfitted
20 Should You Stop Splitting Your Data Like This?
Should You Stop Splitting Your Data Like This?
Underfitted
21 The Function That Changed Everything
The Function That Changed Everything
Underfitted
22 This Model Caused A Nuclear Disaster
This Model Caused A Nuclear Disaster
Underfitted
23 Will Your Code Write Itself?
Will Your Code Write Itself?
Underfitted
24 The Simplest Encoding You’ve Never Heard Of
The Simplest Encoding You’ve Never Heard Of
Underfitted
25 Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Underfitted
26 Can you become a Data Scientist without a Ph.D?
Can you become a Data Scientist without a Ph.D?
Underfitted
27 How to 10x your productivity with ChatGPT?
How to 10x your productivity with ChatGPT?
Underfitted
28 Cheating the Prisoner's Dilemma
Cheating the Prisoner's Dilemma
Underfitted
29 We integrated OpenAI's Whisper with Spot
We integrated OpenAI's Whisper with Spot
Underfitted
30 The Machine Learning School program
The Machine Learning School program
Underfitted
31 We integrated ChatGPT with our robots
We integrated ChatGPT with our robots
Underfitted
32 Solving complex tasks using a Large Language Model (LLM)
Solving complex tasks using a Large Language Model (LLM)
Underfitted
33 5 problems when using a Large Language Model
5 problems when using a Large Language Model
Underfitted
34 We just discovered faster sorting algorithms!
We just discovered faster sorting algorithms!
Underfitted
35 The 3 most important updates to OpenAI's API.
The 3 most important updates to OpenAI's API.
Underfitted
36 People are divided! Does GPT-4 understand what it says?
People are divided! Does GPT-4 understand what it says?
Underfitted
37 How much should you charge hourly as a Machine Learning freelancer?
How much should you charge hourly as a Machine Learning freelancer?
Underfitted
38 Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Underfitted
39 Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Underfitted
40 How to evaluate an LLM-powered RAG application automatically.
How to evaluate an LLM-powered RAG application automatically.
Underfitted
41 Step by step no-code RAG application using Langflow.
Step by step no-code RAG application using Langflow.
Underfitted
42 I built a simple game using Langchain. Here is a step by step tutorial.
I built a simple game using Langchain. Here is a step by step tutorial.
Underfitted
43 I used the first AI Software Engineer for a week. This is happening.
I used the first AI Software Engineer for a week. This is happening.
Underfitted
I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
Underfitted
45 How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
Underfitted
46 How to train a model to generate image embeddings from scratch
How to train a model to generate image embeddings from scratch
Underfitted
47 Building an AI assistant that listens and sees the world (Step by step tutorial)
Building an AI assistant that listens and sees the world (Step by step tutorial)
Underfitted
48 Why are vector databases so FAST?
Why are vector databases so FAST?
Underfitted
49 A Machine Learning roadmap (the one I recommend to my students)
A Machine Learning roadmap (the one I recommend to my students)
Underfitted
50 How to build a real-time AI assistant (with voice and vision)
How to build a real-time AI assistant (with voice and vision)
Underfitted
51 An introduction to Mojo (for Python developers)
An introduction to Mojo (for Python developers)
Underfitted
52 How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
Underfitted
53 Building a CI workflow for those who hate it (using GitHub Actions)
Building a CI workflow for those who hate it (using GitHub Actions)
Underfitted
54 How to run Python Code in Mojo 🔥
How to run Python Code in Mojo 🔥
Underfitted
55 AI will not take your job. Here is what I think will happen instead.
AI will not take your job. Here is what I think will happen instead.
Underfitted
56 How to fine-tune a model using LoRA (step by step)
How to fine-tune a model using LoRA (step by step)
Underfitted
57 Late initialization in Mojo🔥 (Python doesn't support this)
Late initialization in Mojo🔥 (Python doesn't support this)
Underfitted
58 The $1,000,000 problem AI can't solve
The $1,000,000 problem AI can't solve
Underfitted
59 A gentle introduction to RAG (using open-source models)
A gentle introduction to RAG (using open-source models)
Underfitted
60 Automating feedback using ChatGPT and Zapier
Automating feedback using ChatGPT and Zapier
Underfitted

The video teaches how to deploy and test machine learning models, specifically recommendation models, using interleaving experiments to determine if a new model is better than the previous one. This technique allows for testing models with production data without putting the entire system at risk.

Key Takeaways
  1. Deploy a recommendation model
  2. Create a candidate model to test against the champion model
  3. Implement interleaving experiments to compare model performance
  4. Track user feedback and purchases
  5. Evaluate model performance and switch to the candidate model if it is better
💡 Interleaving experiments allow for testing models with production data without putting the entire system at risk by deploying a model that is not good enough.

Related AI Lessons

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Learn the basics of the TypeScript compiler to write better JavaScript code
Medium · JavaScript
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · AI
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression
Medium · Data Science
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →