[MINI] Leakage

Data Skeptic · Beginner ·📐 ML Fundamentals ·9y ago

Key Takeaways

The video discusses the concept of leakage in machine learning, where unintended information about the future is included in the training data, and provides examples of how this can occur, such as including a cancellation page visit or credit checks in the training data. It also touches on the importance of using historical data for training algorithms and the challenges of knowledge representation in machine learning.

Full Transcript

[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism today's topic is [Music] leakage so we bought a house we closed escrow we moved in and we moved in last Friday before jump in any thoughts on the house I tried to take a shower I came to the conclusion after the water running for 10 minutes maybe even more and the water was not getting any warmer that the water heater was broken then I went to tell Kyle here and he was like I don't think he believed me at first he was like he was like well it takes a while for the water to get warm and I was like I know I took a shower before and I know how long it takes and I went to the kitchen and ran the water too and it didn't work yeah and then even after that I don't even think Kyle really believe me but no I tried it and I believed you but then trust but verify so then we called the water heater person yeah and $65 later we learned that there's a very simple solution that all gas meters in California have a little special trigger thing that's uh goes off if it thinks there's an earthquake to stop the gas so you gave me some tasks to accomplish when we moved into the house right did I fulfill your expectations in the things I accomplished I don't know you said you fix the leak under the sink but we're still validating that one yeah how are we going to test it I'll just look randomly yeah so I you could report back maybe but that's a nice little segue for US Today's topic is leakage Linda what does uh what does leakage mean to you what's what's your sort of definition working definition uh something that cannot hold water and therefore lets the water come out cannot hold water lets the water come out or any liquid but that that's like a funnel and a funnel we wouldn't say it leaks well if the intention is for it not to leak yeah if it does if the intention is for it to be waterproof or non- penetrable by water one way we could say it is that a leak is something that we intend for things to stay in an area or to stay restricted and that doesn't happen right we we intend for the water to stay in the pipe but it gets out and that's a leak so it's supposed to be a barrier huh yeah well yeah a barrier or just a a segregation sort of thing so leakage is an important topic in machine learning actually so what does it mean the technical definition is something like if some unintended additional information that goes into your training data that your model can learn from in order to make predictions in a manner that isn't useful in practice is that a helpful definition for you now what is it in Lay person's terms in machine learning you want to train a system to make like predictions or do classifications on a given data set so you provide it some data and you say try and learn the pattern here right but if the pattern contains the answer then it's sort of trivial it's like a short circuit the system can just learn the pattern directly cuz it's already there in the data but it's like a cheat mhm let's jump over like an e commer example right okay let's say you were looking at the way people navigated a website and you kept track of like if they visited certain pages like did they visit the about us page and did they visit the blog page and did they look at the products page and you wanted to predict are these customers that cancel their subscription if they happen to visit the page that's called cancel account. HTML that would almost certainly correlate very highly with predicting that that customer turned out right mhm cuz if you to that page you seem to have the intent of cancelling your account but it's also sort of useless because no one casually goes to the cancel account page like let's just find out what it takes let's let's let's explore the website and its great design on its cancel page like if you go there the immediately next thing you do is cancel your account pretty much right yeah and like maybe you have a winback strategy but more or less if you visit that page while yes it might predict that 1 second later you're going to cancel your account that's not really useful it's not a predictive feature it doesn't help you you can't take advantage of it this is an example of leakage meaning what is leaking what the thing you're trying to predict like is someone going to cancel their account is embedded in the training data you provide and which training data would you be providing you'd be providing that they visited the cancel account page okay now on the other hand that's not like totally bad like you could say like well how many times has this person done a search on our website and that could be useful right because if someone does a lot of searches they're probably engaged and they're like you know buying stuff they're a good customer whatever and someone that does infrequent searches might be likely to say like oh I don't really use this I'm just going to cancel it so that is sort of okay and there's there's some predictive power there's good information in that but this one little place if you include the page where they go to cancel then we would label that leakage because that event is not really useful to you and it's not generally available in your training set but it's it's highly correlated with like a sequential step of the next thing that happens as you cancel so sequential is a key term yeah sequential matters a lot also like in predicting the future is a problem so generally you know you train an algorithm on historical data but you want to usually use it to learn something like make a prediction or a guess about the future since you have all the historical data if you include something that from your reference point is sort of in the future then it qualifies as leakage I guess the key way to say this is is the data available operationally meaning at the time you want to use the data would you have it available so like including information about the future like let's say you want to predict whether someone would be granted a mortgage or not and you were going to look at how many credit checks did they have in a particular calendar year okay so if you have all this historical data and you can be like well how many credit checks did a certain person get in the year 2014 and that's in your your past data you can look at it but if they were trying to get the mortgage in July and you look at the full year of credit checks it's not really representative because if you actually were trying to use this in practice if they were trying to get the morgage in July you can't look into the future because that hasn't happened yet okay why might credit checks be predictive of if you get a mortgage or not well you have to get a loan and if you don't get the loan maybe you just go and get another apartment in which case one or more landlords are going to run a credit check on you so maybe like having more credit checks could indicate someone who failed to get a mortgage and that would be a feature that a machine learning algorithm could pick up and take advantage of but that information technically is sort of in the future if you aggregate by a calendar year so that's a example of leakage when something about the future slips into your data set so I guess the key idea here is if the data is operating a Al unavailable so there's actually a famous example of leakage too that there was uh somebody was working on I think cancer rates and they wanted to predict whether or not someone would survive every patient was sort of anonymized with a hospital code but the code contained information about the facility they were at and some facilities were specialized in really really bad types of cancer that have very high mortality rates and other facilities just took lots of General patients so the facility that took the people that had the really bad cancers with very low chance of survival by definition they specialize in people that have really bad odds against them right so the fact that you went there doesn't mean that Hospital's bad it just means that they're kind of like a last ditch worst case effort kind of place and that data set contained some information that implied the facility they went to so by including that whether you know the person creating the algorithm intended to or not they gave the machine learning algorithm the opportunity to latch onto something that's not actually useful because predicting the mortality rate of someone who goes to the very specialized Clinic that's like a last resort kind of place if you already know they're going there you already know that they're in a bad situation so that's leakage so leakage is bad yeah leakage is very bad so it's wrong because a person who architected the solution didn't know that or didn't think it through we could classify leage in two ways in a lot lot of competitions where people are like competing for who gets the highest score leakage is generally used as a cheat like someone finds a way to like get a high score in like a machine learning competition but leakage happens a lot accidentally a very well-intentioned person might be trying to create a useful algorithm and if they train it and the training enables the algorithm to get its result in a way that's not actually useful or not generalizable then that's really bad form of leakage cuz then their model is not helpful so leakage if you have a model and it has leakage then it means your model is not a good model got to get rid of it got to fix it doesn't generalize okay so then I have a question if you have all these problems you know trying to build an algorithm and M machine learning mhm how do we know that we could trust you people how do you know that all our models don't have leakage in them or all your models are are flawed maybe that that could be well just keep that on the DL we don't want that fact getting out but no how can we avoid it so someone doing this type of work should be very worried about leakage and they should try and remove any field that might be related to leakage so for example like something I've done a fair amount in my life is look at the way people navigate websites and say like well this person you know goes to these pages with these frequency and they have these nav paths can I use that to be predict but in the example of like well if I want to predict that they're going to cancel their account and they go to the cancel account at HTML page that's definitely leakage and I just got to take that out of there I might even have to take out when they go to visit my account page because even though like yeah maybe they just want to change their password they might go to my account which contains cancel account also going to my account could be a good thing like maybe you go to my account regularly cuz you change your password a lot and you want to be like safety first or whatever there's no necessarily always work solution these are just some of the problems that make ml a little difficult it's not really about the algorithms it's about the knowledge representation most of the time you can also add noise that's another technique I found useful once or twice where you add a little bit of noise to make sure that whatever might be leakage is kind of like smoothed out but uh I mean these are the the challenges of a working data scientist and and for the rest of anyone who's given that model come up with a way to empirically test it weak is is bad yes indeed bad for data scientists and bad for kining well said and thank you again for joining me and until next time I want to remind everyone to keep thinking skeptically of and with data for more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on iTunes or Stitcher

Original Description

If you'd like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the future in our training data when building machine learning models. Similarly, if any other feature whose value would not actually be available in practice at the time you'd want to use the model to make a prediction, is a feature that can introduce leakage to your model.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 45 of 60

1 Data Skeptic book giveaway contest winner selection
Data Skeptic book giveaway contest winner selection
Data Skeptic
2 OpenHouse - Front end and API overview
OpenHouse - Front end and API overview
Data Skeptic
3 OpenHouse Crawling with AWS Lambda
OpenHouse Crawling with AWS Lambda
Data Skeptic
4 [MINI] Logistic Regression on Audio Data
[MINI] Logistic Regression on Audio Data
Data Skeptic
5 Data Provenance and Reproducibility with Pachyderm
Data Provenance and Reproducibility with Pachyderm
Data Skeptic
6 [MINI] Primer on Deep Learning
[MINI] Primer on Deep Learning
Data Skeptic
7 Big Data Tools and Trends
Big Data Tools and Trends
Data Skeptic
8 [MINI] Automated Feature Engineering
[MINI] Automated Feature Engineering
Data Skeptic
9 The Data Refuge Project
The Data Refuge Project
Data Skeptic
10 [MINI] The Perceptron
[MINI] The Perceptron
Data Skeptic
11 [MINI] Feed Forward Neural Networks
[MINI] Feed Forward Neural Networks
Data Skeptic
12 Data Science at Patreon
Data Science at Patreon
Data Skeptic
13 [MINI] Backpropagation
[MINI] Backpropagation
Data Skeptic
14 [MINI] GPU CPU
[MINI] GPU CPU
Data Skeptic
15 OpenHouse
OpenHouse
Data Skeptic
16 [MINI] Generative Adversarial Networks
[MINI] Generative Adversarial Networks
Data Skeptic
17 [MINI] AdaBoost
[MINI] AdaBoost
Data Skeptic
18 [MINI] The Bootstrap
[MINI] The Bootstrap
Data Skeptic
19 [MINI] Dropout
[MINI] Dropout
Data Skeptic
20 [MINI] Gini Coefficients
[MINI] Gini Coefficients
Data Skeptic
21 [MINI] Random Forest
[MINI] Random Forest
Data Skeptic
22 [MINI] Heteroskedasticity
[MINI] Heteroskedasticity
Data Skeptic
23 [MINI] ANOVA
[MINI] ANOVA
Data Skeptic
24 Urban Congestion
Urban Congestion
Data Skeptic
25 [MINI] The CAP Theorem
[MINI] The CAP Theorem
Data Skeptic
26 Unstructured Data for Finance
Unstructured Data for Finance
Data Skeptic
27 Detecting Terrorists with Facial Recognition?
Detecting Terrorists with Facial Recognition?
Data Skeptic
28 Predictive Models on Random Data
Predictive Models on Random Data
Data Skeptic
29 [MINI] Entropy
[MINI] Entropy
Data Skeptic
30 [MINI] F1 Score
[MINI] F1 Score
Data Skeptic
31 Causal Impact
Causal Impact
Data Skeptic
32 Machine Learning on Images with Noisy Human-centric Labels
Machine Learning on Images with Noisy Human-centric Labels
Data Skeptic
33 The Library Problem
The Library Problem
Data Skeptic
34 Stealing Models from the Cloud
Stealing Models from the Cloud
Data Skeptic
35 Data Science at eHarmony
Data Science at eHarmony
Data Skeptic
36 Multiple Comparisons and Conversion Optimization
Multiple Comparisons and Conversion Optimization
Data Skeptic
37 Election Predictions
Election Predictions
Data Skeptic
38 [MINI] Calculating Feature Importance
[MINI] Calculating Feature Importance
Data Skeptic
39 MS Connect Conference
MS Connect Conference
Data Skeptic
40 Music21
Music21
Data Skeptic
41 The Police Data and the Data Driven Justice Initiatives
The Police Data and the Data Driven Justice Initiatives
Data Skeptic
42 Studying Competition and Gender Through Chess
Studying Competition and Gender Through Chess
Data Skeptic
43 [MINI] Goodhart's Law
[MINI] Goodhart's Law
Data Skeptic
44 Trusting Machine Learning Models with LIME
Trusting Machine Learning Models with LIME
Data Skeptic
[MINI] Leakage
[MINI] Leakage
Data Skeptic
46 Predictive Policing
Predictive Policing
Data Skeptic
47 Mutli-Agent Diverse Generative Adversarial Networks
Mutli-Agent Diverse Generative Adversarial Networks
Data Skeptic
48 [MINI] Convolutional Neural Networks
[MINI] Convolutional Neural Networks
Data Skeptic
49 Unsupervised Depth Perception
Unsupervised Depth Perception
Data Skeptic
50 [MINI] Max-pooling
[MINI] Max-pooling
Data Skeptic
51 MS Build 2017
MS Build 2017
Data Skeptic
52 Activation Functions
Activation Functions
Data Skeptic
53 Doctor AI
Doctor AI
Data Skeptic
54 [MINI] The Vanishing Gradient
[MINI] The Vanishing Gradient
Data Skeptic
55 CosmosDB
CosmosDB
Data Skeptic
56 Estimating Sheep Pain with Facial Recognition
Estimating Sheep Pain with Facial Recognition
Data Skeptic
57 [MINI] Conditional Independence
[MINI] Conditional Independence
Data Skeptic
58 MINI: Bayesian Belief Networks
MINI: Bayesian Belief Networks
Data Skeptic
59 Project Common Voice
Project Common Voice
Data Skeptic
60 [MINI] Recurrent Neural Networks
[MINI] Recurrent Neural Networks
Data Skeptic

This video teaches the concept of leakage in machine learning and its implications on model performance. It provides examples of how leakage can occur and discusses the importance of using historical data for training algorithms. By understanding leakage, viewers can improve their machine learning models and avoid common pitfalls.

Key Takeaways
  1. Define leakage in machine learning
  2. Identify potential sources of leakage in training data
  3. Use historical data to train algorithms
  4. Avoid including information about the future in training data
  5. Add noise to smooth out leakage
  6. Test machine learning models empirically
💡 Leakage can significantly impact the performance of machine learning models, and using historical data for training algorithms is crucial to avoid this issue.

Related AI Lessons

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Learn the basics of the TypeScript compiler to write better JavaScript code
Medium · JavaScript
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · AI
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression
Medium · Data Science
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →