[MINI] Leakage

Data Skeptic · Beginner ·📐 ML Fundamentals ·9y ago

Skills: ML Maths Basics80%Supervised Learning70%Unsupervised Learning60%

Key Takeaways

The video discusses the concept of leakage in machine learning, where unintended information about the future is included in the training data, and provides examples of how this can occur, such as including a cancellation page visit or credit checks in the training data. It also touches on the importance of using historical data for training algorithms and the challenges of knowledge representation in machine learning.

Full Transcript

[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism today's topic is [Music] leakage so we bought a house we closed escrow we moved in and we moved in last Friday before jump in any thoughts on the house I tried to take a shower I came to the conclusion after the water running for 10 minutes maybe even more and the water was not getting any warmer that the water heater was broken then I went to tell Kyle here and he was like I don't think he believed me at first he was like he was like well it takes a while for the water to get warm and I was like I know I took a shower before and I know how long it takes and I went to the kitchen and ran the water too and it didn't work yeah and then even after that I don't even think Kyle really believe me but no I tried it and I believed you but then trust but verify so then we called the water heater person yeah and $65 later we learned that there's a very simple solution that all gas meters in California have a little special trigger thing that's uh goes off if it thinks there's an earthquake to stop the gas so you gave me some tasks to accomplish when we moved into the house right did I fulfill your expectations in the things I accomplished I don't know you said you fix the leak under the sink but we're still validating that one yeah how are we going to test it I'll just look randomly yeah so I you could report back maybe but that's a nice little segue for US Today's topic is leakage Linda what does uh what does leakage mean to you what's what's your sort of definition working definition uh something that cannot hold water and therefore lets the water come out cannot hold water lets the water come out or any liquid but that that's like a funnel and a funnel we wouldn't say it leaks well if the intention is for it not to leak yeah if it does if the intention is for it to be waterproof or non- penetrable by water one way we could say it is that a leak is something that we intend for things to stay in an area or to stay restricted and that doesn't happen right we we intend for the water to stay in the pipe but it gets out and that's a leak so it's supposed to be a barrier huh yeah well yeah a barrier or just a a segregation sort of thing so leakage is an important topic in machine learning actually so what does it mean the technical definition is something like if some unintended additional information that goes into your training data that your model can learn from in order to make predictions in a manner that isn't useful in practice is that a helpful definition for you now what is it in Lay person's terms in machine learning you want to train a system to make like predictions or do classifications on a given data set so you provide it some data and you say try and learn the pattern here right but if the pattern contains the answer then it's sort of trivial it's like a short circuit the system can just learn the pattern directly cuz it's already there in the data but it's like a cheat mhm let's jump over like an e commer example right okay let's say you were looking at the way people navigated a website and you kept track of like if they visited certain pages like did they visit the about us page and did they visit the blog page and did they look at the products page and you wanted to predict are these customers that cancel their subscription if they happen to visit the page that's called cancel account. HTML that would almost certainly correlate very highly with predicting that that customer turned out right mhm cuz if you to that page you seem to have the intent of cancelling your account but it's also sort of useless because no one casually goes to the cancel account page like let's just find out what it takes let's let's let's explore the website and its great design on its cancel page like if you go there the immediately next thing you do is cancel your account pretty much right yeah and like maybe you have a winback strategy but more or less if you visit that page while yes it might predict that 1 second later you're going to cancel your account that's not really useful it's not a predictive feature it doesn't help you you can't take advantage of it this is an example of leakage meaning what is leaking what the thing you're trying to predict like is someone going to cancel their account is embedded in the training data you provide and which training data would you be providing you'd be providing that they visited the cancel account page okay now on the other hand that's not like totally bad like you could say like well how many times has this person done a search on our website and that could be useful right because if someone does a lot of searches they're probably engaged and they're like you know buying stuff they're a good customer whatever and someone that does infrequent searches might be likely to say like oh I don't really use this I'm just going to cancel it so that is sort of okay and there's there's some predictive power there's good information in that but this one little place if you include the page where they go to cancel then we would label that leakage because that event is not really useful to you and it's not generally available in your training set but it's it's highly correlated with like a sequential step of the next thing that happens as you cancel so sequential is a key term yeah sequential matters a lot also like in predicting the future is a problem so generally you know you train an algorithm on historical data but you want to usually use it to learn something like make a prediction or a guess about the future since you have all the historical data if you include something that from your reference point is sort of in the future then it qualifies as leakage I guess the key way to say this is is the data available operationally meaning at the time you want to use the data would you have it available so like including information about the future like let's say you want to predict whether someone would be granted a mortgage or not and you were going to look at how many credit checks did they have in a particular calendar year okay so if you have all this historical data and you can be like well how many credit checks did a certain person get in the year 2014 and that's in your your past data you can look at it but if they were trying to get the mortgage in July and you look at the full year of credit checks it's not really representative because if you actually were trying to use this in practice if they were trying to get the morgage in July you can't look into the future because that hasn't happened yet okay why might credit checks be predictive of if you get a mortgage or not well you have to get a loan and if you don't get the loan maybe you just go and get another apartment in which case one or more landlords are going to run a credit check on you so maybe like having more credit checks could indicate someone who failed to get a mortgage and that would be a feature that a machine learning algorithm could pick up and take advantage of but that information technically is sort of in the future if you aggregate by a calendar year so that's a example of leakage when something about the future slips into your data set so I guess the key idea here is if the data is operating a Al unavailable so there's actually a famous example of leakage too that there was uh somebody was working on I think cancer rates and they wanted to predict whether or not someone would survive every patient was sort of anonymized with a hospital code but the code contained information about the facility they were at and some facilities were specialized in really really bad types of cancer that have very high mortality rates and other facilities just took lots of General patients so the facility that took the people that had the really bad cancers with very low chance of survival by definition they specialize in people that have really bad odds against them right so the fact that you went there doesn't mean that Hospital's bad it just means that they're kind of like a last ditch worst case effort kind of place and that data set contained some information that implied the facility they went to so by including that whether you know the person creating the algorithm intended to or not they gave the machine learning algorithm the opportunity to latch onto something that's not actually useful because predicting the mortality rate of someone who goes to the very specialized Clinic that's like a last resort kind of place if you already know they're going there you already know that they're in a bad situation so that's leakage so leakage is bad yeah leakage is very bad so it's wrong because a person who architected the solution didn't know that or didn't think it through we could classify leage in two ways in a lot lot of competitions where people are like competing for who gets the highest score leakage is generally used as a cheat like someone finds a way to like get a high score in like a machine learning competition but leakage happens a lot accidentally a very well-intentioned person might be trying to create a useful algorithm and if they train it and the training enables the algorithm to get its result in a way that's not actually useful or not generalizable then that's really bad form of leakage cuz then their model is not helpful so leakage if you have a model and it has leakage then it means your model is not a good model got to get rid of it got to fix it doesn't generalize okay so then I have a question if you have all these problems you know trying to build an algorithm and M machine learning mhm how do we know that we could trust you people how do you know that all our models don't have leakage in them or all your models are are flawed maybe that that could be well just keep that on the DL we don't want that fact getting out but no how can we avoid it so someone doing this type of work should be very worried about leakage and they should try and remove any field that might be related to leakage so for example like something I've done a fair amount in my life is look at the way people navigate websites and say like well this person you know goes to these pages with these frequency and they have these nav paths can I use that to be predict but in the example of like well if I want to predict that they're going to cancel their account and they go to the cancel account at HTML page that's definitely leakage and I just got to take that out of there I might even have to take out when they go to visit my account page because even though like yeah maybe they just want to change their password they might go to my account which contains cancel account also going to my account could be a good thing like maybe you go to my account regularly cuz you change your password a lot and you want to be like safety first or whatever there's no necessarily always work solution these are just some of the problems that make ml a little difficult it's not really about the algorithms it's about the knowledge representation most of the time you can also add noise that's another technique I found useful once or twice where you add a little bit of noise to make sure that whatever might be leakage is kind of like smoothed out but uh I mean these are the the challenges of a working data scientist and and for the rest of anyone who's given that model come up with a way to empirically test it weak is is bad yes indeed bad for data scientists and bad for kining well said and thank you again for joining me and until next time I want to remind everyone to keep thinking skeptically of and with data for more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on iTunes or Stitcher

Original Description

If you'd like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the future in our training data when building machine learning models. Similarly, if any other feature whose value would not actually be available in practice at the time you'd want to use the model to make a prediction, is a feature that can introduce leakage to your model.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 45 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video teaches the concept of leakage in machine learning and its implications on model performance. It provides examples of how leakage can occur and discusses the importance of using historical data for training algorithms. By understanding leakage, viewers can improve their machine learning models and avoid common pitfalls.

Key Takeaways

Define leakage in machine learning
Identify potential sources of leakage in training data
Use historical data to train algorithms
Avoid including information about the future in training data
Add noise to smooth out leakage
Test machine learning models empirically

💡 Leakage can significantly impact the performance of machine learning models, and using historical data for training algorithms is crucial to avoid this issue.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Learn Deep Learning by Hand (Beginner's Guide - Part 1)