Introduction to Adversarial Validation In Machine Learning.
Skills:
ML Maths Basics70%
Key Takeaways
Introduction to adversarial validation in machine learning for handling distribution shifts
Full Transcript
so a company has been collecting data for some time now and they decide to hire you they want you to build a machine learning model to make some predictions and you go in there and starting on the first day you take a look at the data and you grab the last six months of data and you make the decision of train your model using the first five months and then leave the last month for testing your model and everything is great and you spend some time building this model and at the end you realize that the model performs very very well on that training and validation sets however as soon as you test your model using the test data remember the last month that you used your model sucks what's going on here so this is a very very common situation where you spend a long time training your model just to realize that it doesn't generalize well to the test set today we're gonna be talking about a very very clever technique that's gonna help you number one diagnose the problem and number two it will set you on the right track to fix it let's start so we all know about overfitting and there are 1 million ways to fix overfitting to you know to deal with it but there is something that many many people miss in one of these situations where their model is not doing well on the test set and is that maybe just maybe that test data is not coming from the same distribution and the training set so what what does that mean i have here my ipad we're gonna try to take a look at what i'm talking about so let's say we get our data set and we split our data set in two right so we have a training set and then we have some of the data that we're leaving aside for our test set right so i'm going to make the assumption that we could visualize this data in two dimensions so if i go here and i'm gonna try to plot my training set here let's let's make the assumption that this is what that training set will look like if we were to plot it in two dimensions and now let's plot the test set and let's assume that when we plot the test set this is what we get so this is the this is the question that i have can we separate the test set from the training set if they would look like this right and the answer is obviously no there's nothing here that tells us that these two sets are easily separable but if i'm gonna go back and redraw that test set imagine that if we drew the test set this is what we get now in this particular case you can see that we can definitely create a model to split the training data from the test data what this means is there are features on the training data that are pulling our training data to the left side of this chart and there are other features that are pulling our test data to the right side of this chart so these features here are making both of our sets different therefore if we train a model on this training set that model will learn to fit specific information that is not present on the test set um the same thing applies at the time that we test our model on this test set we will expect features that are not there we will expect information that is not there so obviously the model is going to start making some wacky predictions this is a problem that's very common now how could this happen when i started this video i gave you one example where where you take six months worth of data and split that in five months for training and then one month for testing so something that happens very commonly is that the world changes and we don't realize that it changes like when covid happened for example a bunch of the processes a bunch of uh behavior of users completely changed at that time causing many many machine learning models to to stop making good predictions so it doesn't have to be as radical as covet but when something changes the data is going to reflect that change so maybe within that six month period something changed in our data that's causing the last month to have characteristics that are different from the first five months so this is just one example of how this could happen but the bottom line is if your training data is not coming from the same distribution as your testing data you will see a model that underperforms as soon as you test it so let me talk to you about adversarial validation fancy name it sounds really really complex but it actually isn't so adversarial validation will help us determine whether your training data and your test data are coming from the same distribution this is how it works so let's assume i have a data set that looks like this right i have uh four different columns plus the target variable and what we are going to do the way other serial validation works is that we're gonna mix together all of the data that we have we're gonna put together the train data right with our test data and then we are going to get rid of the all target variables so whatever we are trying to predict we don't care right now remember we are focused on determining whether the training data comes from the same distribution as the test data so we're going to get rid of this target column we don't care about it we're going to delete it and we're going to add a new target column call it adversarial validation target you can call it potato it really doesn't matter the goal of this column is to set a value of one for every single sample coming from the training set and a value of zero from every single sample coming from the test set using this new data set now you're gonna build a model and this model has one goal you want with this model to separate or at least to try to separate the training samples from the test samples if this model is successful that means that there is a problem with the distribution of your data if this model is capable of splitting training samples from test samples that means your distributions are different so how do we evaluate this model to put together this model is a very very simple binary classification model you throw all of this data at it and you're going to evaluate it using the roc curve right the receiver operating characteristics that's a mouthful curve okay so let's say we have this curve and this is the way it's going to look like right and here you're going to have the true positive rate of your model and the false positive rate of your model and you know this here means okay so this is the random chance meaning a model that does randomly select samples from where they come from it will perform at about this line and then we are going to use the auc okay the area under the curve to compute how this model is doing to measure the performance of this second model okay so the idea here is very simple if our true positive rate versus false positive rate if it looks like this the area under that curve will be close to 1.0 if we get an auc close to 1.0 that means that our model is very good at telling whether a sample comes from the training set or whether it comes from the test set so what does that mean well if we get a high auc that means that we have a problem with the distribution of both of our sets that means that there is something that's giving away where one sample is coming from so we should take a look at those sets now the opposite case is if our area under the curve or if our roc curve is you know very close to this line here so in that case we will get an auc close to 0.5 meaning our model is very close to randomly deciding making a decision right which means our model cannot tell training samples from test samples that's a good thing if our model cannot tell the samples apart that means that both sets come from the same distribution so that's great so if your model is not doing well on the test set is not because there are differences in the distribution from where they come from that means that you have a different problem but let's assume for one second that you do this you create this new data set you build a model you run a prediction and you get a high auc so the area under the curve is at or close to 1.0 so what do you do next so now you know that the both data sets are not coming from the same distribution but that's great but what do you do with that information well you have that model the next step is for you to find out which features from the data are contributing to those predictions so you can go and list every feature in order of importance to make one of those predictions and what you are going to find out is that the most important features that are contributing to one of those predictions those are the features that are leaking information to this model or leaking the status or the origin of one sample the most important features will be the ones that you want to take a look at to summarize adversarial validation it's a very very clever technique that's gonna help you determine whether your training data is coming from the same distribution as your test data if they are not coming from the same distribution you should expect a gap between the performance of your model on the training set and the performance of your model on the test set so keep in mind that whenever you're seeing that big gap you may want to use adversarial validation to determine whether you have a problem with the distributions and start looking into which features you might need to work with in order to remove those differences i hope this was helpful so let me know in the comments if you have any questions and and if you're looking for a different way i like to call it a more fun way to learn and keep up with machine learning take a look at binomial.com the url is going to be down here where we write one machine learning question every single day for you to practice what you know and if you make a mistake we always have a detailed explanation letting you know what the right answer is and why we picked that as the right answer and with that i'll see you on the next one ciao so let's say we have a data set oops i turned it off it's not that easy to work with this ipad here okay
Original Description
There are many sources of overfitting, but an important one is when your training and test data do not come from the same distribution.
Unfortunately, this is not an uncommon problem. For example, training a model with data collected during a period different from the test or production data could lead to poor performance. Even slight differences could considerably affect your results, but this is still an issue many people struggle to identify and decide how to better move forward.
That's where Adversarial Validation comes in.
🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1
📚 My 3 favorite Machine Learning books:
• Deep Learning With Python, Second Edition — https://amzn.to/3xA3bVI
• Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — https://amzn.to/3BOX3LP
• Machine Learning with PyTorch and Scikit-Learn — https://amzn.to/3f7dAC8
Twitter: https://twitter.com/svpino
Disclaimer: Some of the links included in this description are affiliate links where I'll earn a small commission if you purchase something. There's no cost to you.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Underfitted · Underfitted · 3 of 60
1
2
▶
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Test-Time Augmentation In Machine Learning.
Underfitted
Don't Replace Missing Values In Your Dataset.
Underfitted
Introduction to Adversarial Validation In Machine Learning.
Underfitted
Introduction To Autoencoders In Machine Learning.
Underfitted
Active Learning. The Secret of Training Models Without Labels.
Underfitted
Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Underfitted
The Confusion Matrix in Machine Learning
Underfitted
3 Tips to Build a Career in Machine Learning (Unconventional Advice)
Underfitted
I can predict cars CRASHING. And it's 99% accurate!
Underfitted
A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
Underfitted
The BEST Machine Learning Interview Strategy.
Underfitted
OpenAI’s Whisper is AMAZING!
Underfitted
5 Lessons You’re NOT Taught in School
Underfitted
TensorFlow On Apple Silicon. Step-by-Step Instructions
Underfitted
Generating Images From Text. Stable Diffusion, Explained
Underfitted
The Wrong Batch Size Will Ruin Your Model
Underfitted
8 Mistakes Holding Your Career Back | Machine Learning
Underfitted
AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
Underfitted
Bias and Variance, Simplified
Underfitted
Should You Stop Splitting Your Data Like This?
Underfitted
The Function That Changed Everything
Underfitted
This Model Caused A Nuclear Disaster
Underfitted
Will Your Code Write Itself?
Underfitted
The Simplest Encoding You’ve Never Heard Of
Underfitted
Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Underfitted
Can you become a Data Scientist without a Ph.D?
Underfitted
How to 10x your productivity with ChatGPT?
Underfitted
Cheating the Prisoner's Dilemma
Underfitted
We integrated OpenAI's Whisper with Spot
Underfitted
The Machine Learning School program
Underfitted
We integrated ChatGPT with our robots
Underfitted
Solving complex tasks using a Large Language Model (LLM)
Underfitted
5 problems when using a Large Language Model
Underfitted
We just discovered faster sorting algorithms!
Underfitted
The 3 most important updates to OpenAI's API.
Underfitted
People are divided! Does GPT-4 understand what it says?
Underfitted
How much should you charge hourly as a Machine Learning freelancer?
Underfitted
Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Underfitted
Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Underfitted
How to evaluate an LLM-powered RAG application automatically.
Underfitted
Step by step no-code RAG application using Langflow.
Underfitted
I built a simple game using Langchain. Here is a step by step tutorial.
Underfitted
I used the first AI Software Engineer for a week. This is happening.
Underfitted
I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
Underfitted
How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
Underfitted
How to train a model to generate image embeddings from scratch
Underfitted
Building an AI assistant that listens and sees the world (Step by step tutorial)
Underfitted
Why are vector databases so FAST?
Underfitted
A Machine Learning roadmap (the one I recommend to my students)
Underfitted
How to build a real-time AI assistant (with voice and vision)
Underfitted
An introduction to Mojo (for Python developers)
Underfitted
How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
Underfitted
Building a CI workflow for those who hate it (using GitHub Actions)
Underfitted
How to run Python Code in Mojo 🔥
Underfitted
AI will not take your job. Here is what I think will happen instead.
Underfitted
How to fine-tune a model using LoRA (step by step)
Underfitted
Late initialization in Mojo🔥 (Python doesn't support this)
Underfitted
The $1,000,000 problem AI can't solve
Underfitted
A gentle introduction to RAG (using open-source models)
Underfitted
Automating feedback using ChatGPT and Zapier
Underfitted
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI