I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
Key Takeaways
The video discusses deploying machine learning models, specifically a recommendation model, and testing new versions using interleaving experiments to determine if the new model is better than the previous one.
Full Transcript
deploying machine learning models extremely important topic obviously so over the last few weeks I've been working with a company helping them deploy some machine learning models specifically they have a recommendation model I provide certain recommendations as the output and they want uh the problem that they're trying to solve is not how to deploy that model they know how to do that but how they can test that a new version of the model is actually better than the previous version and this is where I come and this is what I'm I'm helping them to do so today I want to show you the technique that I implemented and hopefully that will give you some ideas for your project it's very sophisticated I think or or at least it's very cool and remember the goal here is how do we know that a new version of a model is better than the existing version and when you talk to people they're going to tell you what you just evaluate the model and compare the performance of the model yes but the problem with a recommendation system with a model like this is that you cannot just explore how good the model is in a vacuum you need user feedback and that is the main component that is going to tell you whether your model works or doesn't work if you provide what look like good recommendations but people don't click on them then your model doesn't work so imagine that what you're trying to do is recommend additional products s based on the purchase history of a user the true test of whether those recommendations are good is whether or not users buy those recommendations they may look amazing on paper but if users don't care then your recommendations are bad that is the main challenge here so let me show you a diagram of the technique that I've been uh working with this company on implementing it's called interleaving expert expent and here the idea just so you follow the diagram is that we have a client that is going to send a request this is a web client that we have and the request is going to be give me recommendation for user ABC and then we have a prediction service this prediction service think of it as the API endpoint where the client application connects to the prediction service when it has only a single model it's going to use that model to just generate five recommendations and send back to the user what we was adding two models now so the prediction service instead of sending the request to the model now it's going to send the same request to two different models I'm identifying those models as Legacy model which is the current version that's deployed some people like to call this the champion model it's the model that's currently running and the candidate model which is this second model here in blue with the dott line the candidate model is the model that we want to test is it better than the Legacy model this is what we want to test right here so the prediction service is going to send 100% requests to both models and it's going to ask both models can you please generate recommendations but before before we had a candidate model all of the recommendations were coming from the Legacy model so the Legacy model generated these three pink recommendations and we will just sending those recommendations back to the client now we're going to be generating recommendations using both both models and interleaving those recommendations in a response so the client will not see recommendations from the Legacy model or the candidate model but it will see recommendations from both models at the same time so we inter leave maybe one recommendation from the Legacy model one from the candidate model one from the Legacy model one from the candidate model then one from the Legacy model to complete the five recommendations and this gives us a couple of good things so number one one we are hedging here and if the candidate model is horrible we are not destroying our application so imagine the candidate model generates good-look recommendations that people don't care about if we just swap the Legacy model by the candidate model well after a month it's just all of our purchases are going to go down the drain because the recommendations are really really bad so we don't want to do that instead we're going to be hedging and monitoring over time how good the candidate recommendations are and we will only switch to the candidate model when we are certain that those recommendations are really good in this particular case here let's say the user sent a request and we send back five different recommendations we can track what the users do with those recommendations are they buying the products that have been recommended by the candidate model or have they not bought anything from those recommendations so obviously it depends on how much traffic your site gets so assuming you get decent traffic you may need to run this for a couple weeks that's our case we run it for two weeks after 2 weeks we collect enough information to aggregate all of the purchases and determine is the candidate model unpair or better than the Legacy model if that is the case then we switch 100% of the traffic goes to the candidate model that becomes the champion at that point and you know we will have to build a new contender later on to just to run the same thing when we have a new version if the candidate model is not working well if people are not clicking on those recommendations then we can just discard that model improve that model come back with a new version later on something else that's also really important is whenever you're presenting a list of recommendations people will tend to favor recommendations at the top like if I give you these are my top five I don't know air conditioner units right people are going to check number one first they're going to tend to favor number one first even though you're going to say in no particular order it doesn't matter number one is going to get favored you need to keep that in mind when you are trying to compare the Legacy model with the candidate model like there are multiple ways to go about this like one particular uh technique is to randomize those recommendations so maybe you trust the candidate Model A little bit more and you are comfortable randomizing recommend the order of recommend ations so that will be one way of doing it the second way will be waiting recommendations based on their position in that list like anything at the top you will know will get more clicks so you will not just favor the model that you're always using to display the top one recommendation because it wouldn't make sense so hopefully that makes sense again this is uh not necessarily fairly sophisticated but it's a very cool way to test anything that requires user feedback it's a good way to test model with production real production data without having to put your entire system at risk by deploying a model that is not good enough you can test this in a back SE little by little you can increase the number of recommendations that come from that candidate model until you build the confidence to deploy that candidate model to make that candidate model your champion model so hopefully this helps and I'll see you later with more tips
Original Description
I teach a live, interactive program that'll help you build production-ready Machine Learning systems from the ground up. Check it out here:
https://www.ml.school
To keep up with my content:
• Twitter/X: https://www.twitter.com/svpino
• LinkedIn: https://www.linkedin.com/in/svpino
🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Underfitted · Underfitted · 44 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
▶
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Test-Time Augmentation In Machine Learning.
Underfitted
Don't Replace Missing Values In Your Dataset.
Underfitted
Introduction to Adversarial Validation In Machine Learning.
Underfitted
Introduction To Autoencoders In Machine Learning.
Underfitted
Active Learning. The Secret of Training Models Without Labels.
Underfitted
Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Underfitted
The Confusion Matrix in Machine Learning
Underfitted
3 Tips to Build a Career in Machine Learning (Unconventional Advice)
Underfitted
I can predict cars CRASHING. And it's 99% accurate!
Underfitted
A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
Underfitted
The BEST Machine Learning Interview Strategy.
Underfitted
OpenAI’s Whisper is AMAZING!
Underfitted
5 Lessons You’re NOT Taught in School
Underfitted
TensorFlow On Apple Silicon. Step-by-Step Instructions
Underfitted
Generating Images From Text. Stable Diffusion, Explained
Underfitted
The Wrong Batch Size Will Ruin Your Model
Underfitted
8 Mistakes Holding Your Career Back | Machine Learning
Underfitted
AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
Underfitted
Bias and Variance, Simplified
Underfitted
Should You Stop Splitting Your Data Like This?
Underfitted
The Function That Changed Everything
Underfitted
This Model Caused A Nuclear Disaster
Underfitted
Will Your Code Write Itself?
Underfitted
The Simplest Encoding You’ve Never Heard Of
Underfitted
Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Underfitted
Can you become a Data Scientist without a Ph.D?
Underfitted
How to 10x your productivity with ChatGPT?
Underfitted
Cheating the Prisoner's Dilemma
Underfitted
We integrated OpenAI's Whisper with Spot
Underfitted
The Machine Learning School program
Underfitted
We integrated ChatGPT with our robots
Underfitted
Solving complex tasks using a Large Language Model (LLM)
Underfitted
5 problems when using a Large Language Model
Underfitted
We just discovered faster sorting algorithms!
Underfitted
The 3 most important updates to OpenAI's API.
Underfitted
People are divided! Does GPT-4 understand what it says?
Underfitted
How much should you charge hourly as a Machine Learning freelancer?
Underfitted
Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Underfitted
Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Underfitted
How to evaluate an LLM-powered RAG application automatically.
Underfitted
Step by step no-code RAG application using Langflow.
Underfitted
I built a simple game using Langchain. Here is a step by step tutorial.
Underfitted
I used the first AI Software Engineer for a week. This is happening.
Underfitted
I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
Underfitted
How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
Underfitted
How to train a model to generate image embeddings from scratch
Underfitted
Building an AI assistant that listens and sees the world (Step by step tutorial)
Underfitted
Why are vector databases so FAST?
Underfitted
A Machine Learning roadmap (the one I recommend to my students)
Underfitted
How to build a real-time AI assistant (with voice and vision)
Underfitted
An introduction to Mojo (for Python developers)
Underfitted
How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
Underfitted
Building a CI workflow for those who hate it (using GitHub Actions)
Underfitted
How to run Python Code in Mojo 🔥
Underfitted
AI will not take your job. Here is what I think will happen instead.
Underfitted
How to fine-tune a model using LoRA (step by step)
Underfitted
Late initialization in Mojo🔥 (Python doesn't support this)
Underfitted
The $1,000,000 problem AI can't solve
Underfitted
A gentle introduction to RAG (using open-source models)
Underfitted
Automating feedback using ChatGPT and Zapier
Underfitted
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
Stop Overfitting With Basically One Line of Code
Medium · AI
Stop Overfitting With Basically One Line of Code
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI