I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

Underfitted · Beginner ·📐 ML Fundamentals ·2y ago

Skills: ML Pipelines90%Supervised Learning60%

Key Takeaways

The video discusses deploying machine learning models, specifically a recommendation model, and testing new versions using interleaving experiments to determine if the new model is better than the previous one.

Full Transcript

deploying machine learning models extremely important topic obviously so over the last few weeks I've been working with a company helping them deploy some machine learning models specifically they have a recommendation model I provide certain recommendations as the output and they want uh the problem that they're trying to solve is not how to deploy that model they know how to do that but how they can test that a new version of the model is actually better than the previous version and this is where I come and this is what I'm I'm helping them to do so today I want to show you the technique that I implemented and hopefully that will give you some ideas for your project it's very sophisticated I think or or at least it's very cool and remember the goal here is how do we know that a new version of a model is better than the existing version and when you talk to people they're going to tell you what you just evaluate the model and compare the performance of the model yes but the problem with a recommendation system with a model like this is that you cannot just explore how good the model is in a vacuum you need user feedback and that is the main component that is going to tell you whether your model works or doesn't work if you provide what look like good recommendations but people don't click on them then your model doesn't work so imagine that what you're trying to do is recommend additional products s based on the purchase history of a user the true test of whether those recommendations are good is whether or not users buy those recommendations they may look amazing on paper but if users don't care then your recommendations are bad that is the main challenge here so let me show you a diagram of the technique that I've been uh working with this company on implementing it's called interleaving expert expent and here the idea just so you follow the diagram is that we have a client that is going to send a request this is a web client that we have and the request is going to be give me recommendation for user ABC and then we have a prediction service this prediction service think of it as the API endpoint where the client application connects to the prediction service when it has only a single model it's going to use that model to just generate five recommendations and send back to the user what we was adding two models now so the prediction service instead of sending the request to the model now it's going to send the same request to two different models I'm identifying those models as Legacy model which is the current version that's deployed some people like to call this the champion model it's the model that's currently running and the candidate model which is this second model here in blue with the dott line the candidate model is the model that we want to test is it better than the Legacy model this is what we want to test right here so the prediction service is going to send 100% requests to both models and it's going to ask both models can you please generate recommendations but before before we had a candidate model all of the recommendations were coming from the Legacy model so the Legacy model generated these three pink recommendations and we will just sending those recommendations back to the client now we're going to be generating recommendations using both both models and interleaving those recommendations in a response so the client will not see recommendations from the Legacy model or the candidate model but it will see recommendations from both models at the same time so we inter leave maybe one recommendation from the Legacy model one from the candidate model one from the Legacy model one from the candidate model then one from the Legacy model to complete the five recommendations and this gives us a couple of good things so number one one we are hedging here and if the candidate model is horrible we are not destroying our application so imagine the candidate model generates good-look recommendations that people don't care about if we just swap the Legacy model by the candidate model well after a month it's just all of our purchases are going to go down the drain because the recommendations are really really bad so we don't want to do that instead we're going to be hedging and monitoring over time how good the candidate recommendations are and we will only switch to the candidate model when we are certain that those recommendations are really good in this particular case here let's say the user sent a request and we send back five different recommendations we can track what the users do with those recommendations are they buying the products that have been recommended by the candidate model or have they not bought anything from those recommendations so obviously it depends on how much traffic your site gets so assuming you get decent traffic you may need to run this for a couple weeks that's our case we run it for two weeks after 2 weeks we collect enough information to aggregate all of the purchases and determine is the candidate model unpair or better than the Legacy model if that is the case then we switch 100% of the traffic goes to the candidate model that becomes the champion at that point and you know we will have to build a new contender later on to just to run the same thing when we have a new version if the candidate model is not working well if people are not clicking on those recommendations then we can just discard that model improve that model come back with a new version later on something else that's also really important is whenever you're presenting a list of recommendations people will tend to favor recommendations at the top like if I give you these are my top five I don't know air conditioner units right people are going to check number one first they're going to tend to favor number one first even though you're going to say in no particular order it doesn't matter number one is going to get favored you need to keep that in mind when you are trying to compare the Legacy model with the candidate model like there are multiple ways to go about this like one particular uh technique is to randomize those recommendations so maybe you trust the candidate Model A little bit more and you are comfortable randomizing recommend the order of recommend ations so that will be one way of doing it the second way will be waiting recommendations based on their position in that list like anything at the top you will know will get more clicks so you will not just favor the model that you're always using to display the top one recommendation because it wouldn't make sense so hopefully that makes sense again this is uh not necessarily fairly sophisticated but it's a very cool way to test anything that requires user feedback it's a good way to test model with production real production data without having to put your entire system at risk by deploying a model that is not good enough you can test this in a back SE little by little you can increase the number of recommendations that come from that candidate model until you build the confidence to deploy that candidate model to make that candidate model your champion model so hopefully this helps and I'll see you later with more tips

Original Description

I teach a live, interactive program that'll help you build production-ready Machine Learning systems from the ground up. Check it out here: https://www.ml.school To keep up with my content: • Twitter/X: https://www.twitter.com/svpino • LinkedIn: https://www.linkedin.com/in/svpino 🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Underfitted · Underfitted · 44 of 60

← Previous Next →

Test-Time Augmentation In Machine Learning.

Test-Time Augmentation In Machine Learning.

Don't Replace Missing Values In Your Dataset.

Don't Replace Missing Values In Your Dataset.

Introduction to Adversarial Validation In Machine Learning.

Introduction to Adversarial Validation In Machine Learning.

Introduction To Autoencoders In Machine Learning.

Introduction To Autoencoders In Machine Learning.

Active Learning. The Secret of Training Models Without Labels.

Active Learning. The Secret of Training Models Without Labels.

Early Stopping. The Most Popular Regularization Technique In Machine Learning.

Early Stopping. The Most Popular Regularization Technique In Machine Learning.

The Confusion Matrix in Machine Learning

The Confusion Matrix in Machine Learning

3 Tips to Build a Career in Machine Learning (Unconventional Advice)

3 Tips to Build a Career in Machine Learning (Unconventional Advice)

I can predict cars CRASHING. And it's 99% accurate!

I can predict cars CRASHING. And it's 99% accurate!

A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.

A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.

The BEST Machine Learning Interview Strategy.

The BEST Machine Learning Interview Strategy.

OpenAI’s Whisper is AMAZING!

OpenAI’s Whisper is AMAZING!

5 Lessons You’re NOT Taught in School

5 Lessons You’re NOT Taught in School

TensorFlow On Apple Silicon. Step-by-Step Instructions

TensorFlow On Apple Silicon. Step-by-Step Instructions

Generating Images From Text. Stable Diffusion, Explained

Generating Images From Text. Stable Diffusion, Explained

The Wrong Batch Size Will Ruin Your Model

The Wrong Batch Size Will Ruin Your Model

8 Mistakes Holding Your Career Back | Machine Learning

8 Mistakes Holding Your Career Back | Machine Learning

AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained

AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained

Bias and Variance, Simplified

Bias and Variance, Simplified

Should You Stop Splitting Your Data Like This?

Should You Stop Splitting Your Data Like This?

The Function That Changed Everything

The Function That Changed Everything

This Model Caused A Nuclear Disaster

This Model Caused A Nuclear Disaster

Will Your Code Write Itself?

Will Your Code Write Itself?

The Simplest Encoding You’ve Never Heard Of

The Simplest Encoding You’ve Never Heard Of

Superhuman AI Cracked An Impossible Game! | DeepNash, Explained

Superhuman AI Cracked An Impossible Game! | DeepNash, Explained

Can you become a Data Scientist without a Ph.D?

Can you become a Data Scientist without a Ph.D?

How to 10x your productivity with ChatGPT?

How to 10x your productivity with ChatGPT?

Cheating the Prisoner's Dilemma

Cheating the Prisoner's Dilemma

We integrated OpenAI's Whisper with Spot

We integrated OpenAI's Whisper with Spot

The Machine Learning School program

The Machine Learning School program

We integrated ChatGPT with our robots

We integrated ChatGPT with our robots

Solving complex tasks using a Large Language Model (LLM)

Solving complex tasks using a Large Language Model (LLM)

5 problems when using a Large Language Model

5 problems when using a Large Language Model

We just discovered faster sorting algorithms!

We just discovered faster sorting algorithms!

The 3 most important updates to OpenAI's API.

The 3 most important updates to OpenAI's API.

People are divided! Does GPT-4 understand what it says?

People are divided! Does GPT-4 understand what it says?

How much should you charge hourly as a Machine Learning freelancer?

How much should you charge hourly as a Machine Learning freelancer?

Building a RAG application from scratch using Python, LangChain, and the OpenAI API

Building a RAG application from scratch using Python, LangChain, and the OpenAI API

Building a RAG application using open-source models (Asking questions from a PDF using Llama2)

Building a RAG application using open-source models (Asking questions from a PDF using Llama2)

How to evaluate an LLM-powered RAG application automatically.

How to evaluate an LLM-powered RAG application automatically.

Step by step no-code RAG application using Langflow.

Step by step no-code RAG application using Langflow.

I built a simple game using Langchain. Here is a step by step tutorial.

I built a simple game using Langchain. Here is a step by step tutorial.

I used the first AI Software Engineer for a week. This is happening.

I used the first AI Software Engineer for a week. This is happening.

I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)

How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)

How to train a model to generate image embeddings from scratch

How to train a model to generate image embeddings from scratch

Building an AI assistant that listens and sees the world (Step by step tutorial)

Building an AI assistant that listens and sees the world (Step by step tutorial)

Why are vector databases so FAST?

Why are vector databases so FAST?

A Machine Learning roadmap (the one I recommend to my students)

A Machine Learning roadmap (the one I recommend to my students)

How to build a real-time AI assistant (with voice and vision)

How to build a real-time AI assistant (with voice and vision)

An introduction to Mojo (for Python developers)

An introduction to Mojo (for Python developers)

How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)

How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)

Building a CI workflow for those who hate it (using GitHub Actions)

Building a CI workflow for those who hate it (using GitHub Actions)

How to run Python Code in Mojo 🔥

How to run Python Code in Mojo 🔥

AI will not take your job. Here is what I think will happen instead.

AI will not take your job. Here is what I think will happen instead.

How to fine-tune a model using LoRA (step by step)

How to fine-tune a model using LoRA (step by step)

Late initialization in Mojo🔥 (Python doesn't support this)

Late initialization in Mojo🔥 (Python doesn't support this)

The $1,000,000 problem AI can't solve

The $1,000,000 problem AI can't solve

A gentle introduction to RAG (using open-source models)

A gentle introduction to RAG (using open-source models)

Automating feedback using ChatGPT and Zapier

Automating feedback using ChatGPT and Zapier

The video teaches how to deploy and test machine learning models, specifically recommendation models, using interleaving experiments to determine if a new model is better than the previous one. This technique allows for testing models with production data without putting the entire system at risk.

Key Takeaways

Deploy a recommendation model
Create a candidate model to test against the champion model
Implement interleaving experiments to compare model performance
Track user feedback and purchases
Evaluate model performance and switch to the candidate model if it is better

💡 Interleaving experiments allow for testing models with production data without putting the entire system at risk by deploying a model that is not good enough.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Learn Deep Learning by Hand (Beginner's Guide - Part 1)