Predictive Models on Random Data

Data Skeptic · Beginner ·📐 ML Fundamentals ·9y ago

Skills: ML Maths Basics80%Supervised Learning70%Unsupervised Learning60%

Key Takeaways

The video discusses predictive models on random data, highlighting issues such as leakage, overfitting, and cross-validation bias, and provides insights on how to detect and avoid these problems in machine learning models, including techniques like stratification and regularization.

Full Transcript

[Music] data skeptic features interviews with experts on topics related to data science all through the eye of scientific [Music] skepticism Claudia perich has a PHD in Information Systems from NYU and an MA in computer science from the University of Colorado she is currently the chief scientist at Distillery working on optimizing machine Learning Systems to improv display ad targeting Claudia welcome to data skeptic thank you so much for having me it's a pleasure oh I'm really glad to have you here I read a number of your papers recently and there were two in particular that stood out to me I thought would make a really great conversation the first being uh leakage in data mining formulation detection and avoidance and the second being on Cross validation and stacking building seemingly predictive models on random data I like these sorts of things I think they fit well with my audience I was trying to characterize some of this to a friend who's not really a machine learning person and I was saying well it's not really about Diagnostics it's more like maybe auditing or the best analogy I came up with was when you buy a thermometer you know it'll tell you its precision and things like that but then it also usually has this range of like it doesn't work when it's extremely hot or extremely cold that was sort of the closest real world analogy I got do you think that's fitting or or how do you describe this collection of work while my job and probably my core competence is building models that are really really good at predicting things I really get kind of intrigued when things go off the cliff for all kinds of reasons sometimes it's The Usual Suspects that you have with bad data and so on but every time there is kind of a discrepancy of what I expect my methods to do and what I see happening that that I feel like wow I need to understand that and so the two examples you picked are two of these moments where kind of H that doesn't feel right it just doesn't sit well with me and I really need to get to the bottom of it to understand more about how these learning systems work and so for me it's almost this kind of tension between your intuition and and what you see happening that kind of kicks off almost a detective sense of needing to get to the bottom of stuff that may not be a good analogy for your friend but that's kind of what it looks like and feels like from my perspective yeah no absolutely I think it's very fitting longtime listeners of the show will know about cross validation I talk about it a lot but not necessarily with the practitioner Precision that is covered in your paper so maybe it would be good to open up um a discussion of some of the ways one might cross validate the whole point of cross validation goes back to the fundamentals of machine learning that you can never ever evaluate what the machine has learned on the data set that you gave it to learn from because in the worst case it just memorizes everything and tell you about exactly what it saw in the data and those types of models tend to be completely useless in practice because the moment you give them something new they are clueless so this is referred to technically as overfitting and standard practice has been well you really should evaluate whatever you did on a part of the data set that had never been in Any Which Way involved in any of the decisions you made in building the model and so usually people talk about hold out and test set and so on there are occasions where you don't have an awful lot of data so you feel like you don't really want to keep any data to yourself for that test because every data point you take out from the training set means your model estimation part gets less information and as less good as a result and so cross validation on some level tries to be the fix to this problem to say well yeah okay fine how about we just do it multiple times where you break up your entire data set in chunks and then you learn on N minus one chunk and you can always evaluate on that other piece and you do it multi multiple times as many trunks as you have and so you end up with what people consider hold out predictions on the entire data set for your valuation so that's broadly cross validation and typically people feel like okay 10 folds probably good enough and their technical paper who compare basically threefold to 10 fold to a millionfold and there seems to be some consensus that 10 is not a bad place to start now let's talk about stratification one thing that happens when you need cross validation the most because you have little data is when you have very few positives meaning your minority class has very very few examples so in in a classification setting there's typically one class that's very common and then there is this exceptional thing that you want to predict could be fraud could be people buying things so there's always a rare often a rare class and if you now just blindly do cross validation you will get very skewed results from a theoretical standpoint everybody always talks about IID right so you have heard that your test set should be independently sampled but when you do cross validation and you remove a chunk from your data set for testing you have artificially created a negative pendence because yes the information in your test set no longer is in your training set all does it mean when you have very few positives let's be extreme here right the data set has three positives now you make a cross validation attempt where you find a subset and there's one positive in a test set that means now there only two in the training set let's say you had 100 to start with but your test set has one out of 10 meaning the positive rate in the test set is actually it's very different from the original Three out of 100 and the worst part is it's exactly the opposite from the training set so in the training set the base weight goes down in the test set the base weight goes up mhm now what happens if you build a model on this models typically well calibrated so on average the model prediction probabilities will be equal to the base rate if not you have a really terrible model so we're not even talking about it so on this particular subset your model will predict a lower average probability when the test set in fact has an average higher probability so it looks like your model is actually off in terms of calibration not because it's bad but you made that happen by just separating your training set in that way so what people do they say okay at the very least I need to maintain the base rate so if I only have three positives I will do three bags one positive in each and that way the base rate's always the same both in training and in testing and now your evaluation no longer comes up with spirous lack of calibration this is an example where because you have very few positive examples when you cross validate you do stratification to ensure that there's always is kind of a comparison so you almost get back to the IID ideal that you're supposed to have mhm so that's what you okay off topic you ask about the specifics of uh calibration M so I uh I recall you describing this in a really convenient way and saying that cross validation is a zero sum setting um I've never heard it said quite that way uh could you maybe share a little bit of the Insight in how you came to that realization we can talk theoretic about stuff like IID sampling and what it means but I think the easiest way to understand this artifact that requires a calibration is what I refer to as cross validation being a zero sum game it means that whatever you choose to not have in the training set will be in the test set and vice versa because in total you're still stuck with the fact that you have a small training set that is your total sum your choice affects one or the other yeah it's interesting like uh cross validation I imagine we would agree it's a procedure that a responsible machine learning researcher should be using um it's very helpful for us we can prevent overfitting and things like that but I'd never stop to think that I would be biasing my own data set by using it does this problem go away as I have a less imbalance sample or a larger data set yes so in in fact all of the artifacts that I'm describing in this paper and that actually initially spurred my my research on the topic really are not important if you have thousands of examples because at that point almost any random sample thanks to the law of large numbers will have have on average the same percentage of positives and you're in a good place so all of that kind of goes away when you have enough data that you actually just can use a test set and don't really need cross validation so that's kind of the irony when you need cross validation the most that's when it can also Lead You astray so in some sense it goes back to there's no free lunch if you don't have enough data you have no you don't have enough data and there is no Silver Bullet to deal with it it's an interesting dilemma because we have these tools that we're very happy with in general and I say tools I mean algorithms and methodologies and things just because they work at one situation or environment doesn't mean they work globally some of the figures you have that I thought were really helpful in in in making this concept clear to me were the Scatter Plots where you show the trained area under the curve versus the stratified cross validation area under the curve I know it's very difficult to talk data visa on an audio podcast but could you share U you know maybe a rough description of those and and what story they're telling us in order to understand what happens when you plot Au curves area under the curves there are different ways to then use the information from your different cross validation samples as a total assessment of the performance and one way is to say okay you just calculate your performance on that l little test set across all the little test set and you take the average MH but you may also consider well you know my test sets on very big either let's face it any statistics you calculate on a small data set is probably not very reliable in the same way that bagging helps you averaging an alternative way of reducing the variance of your estimate and the test set is to actually merge all the predictions on all these different tests to have kind of the full out of sample prediction on everything and then evaluate on that now you have more data and you feel you have a more kind of reliable sample for out of sample performance but what happens now when you combine the results from different test sets that were built on training sets with different baselines the calibration will start to affect the overall ranking in a very artificial way you will have basically examples from certain test sets all at the beginning and then example from other test sets at the end so the total ranking is actually no longer random across the test sets but you will have this Baseline that came from the training set carried over into these chunks and that leads to a typically underestimate of your True Performance because it creates this inverse bias when the training set had a high base rate it will predict a high probability but by definition there very few Poss positives in that chunk of the test set in that case whenever the test set has a low prediction you have a lot of positives and when you have in the test set a high average prediction you have few positives and that obviously makes your ranking look really terrible and you get low AOC compared to doing it on a larger sample correctly when you had the luxury of everything so you're underestimating your True Performance I think this is a useful diagnostic I'm going to adopt now when I work in smaller data or or rare event problems to look at the you know Au plotted in in this fashion do you think this is the an ideal diagnostic for catching this or is there more we need to look out for in some sense my feeling is we're just maybe pushing the system too hard if the only thing you have is 10 positives maybe you should reconsider if this is really what you want to do maybe instead you should think of completely different techniques like transfer learning of somehow getting a better proxy that has a higher base rate to learn from ultimately all of the things we talk about are not the result of nonoptimal methodology but you're really running up to what is at all possible if you're limited in your training set and so you're just kind of pushing the boundaries maybe just an inch too far the good news is when you get doing all the right things a model that predicts worse than random that tells you that you probably have pushed your attempt to even predict something to form it's not that you did something wrong it's just this problem in its current form just may not be solvable I very much agree with that assessment and I think transfer learning is is an interesting approach someone else might come along and say oh you know I'm going to fix this with regularization or with a some sort of constraint on my optimization what are your thoughts on trying to you know squeeze too hard to get a little bit more juice out of the orange so to speak the point of the article is that that in fact doesn't help at all it will continue to represent that problem it just becomes in fact more obvious for instance I could say let's really regularize in fact I don't want any non-zero coefficient or in the world of a tree I don't let it split at all it's just a stump there's nothing there what does the model do in that case it can only predict a constant which is in fact the base rate of the training set but the problem is exactly the the same on the test set that you have for that base rate you have again this inverse correlation between the number of positives so even if you regularize heavily or if you make sure that there was no signal to start with the problem actually persists that you see in in Cross validation getting this inverse signal I also want to ask a little bit about stacking and bagging which are of course important themes in the paper I don't know that I've covered stacking on the show before would you mind giving a a basic definition what we talked about right now cross validation as I said is actually a good case scenario because all of your evaluation tells you that your model is terrible mhm in fact it may tell you that your model is worse than random that's actually good news in the sense you get a very clear slap on the hand to walk away but let's talk about a second technique where in combination it creates a problem you mentioned stacking here and I've used stacking a couple of times I'm sure you have heard of ensembles before mhm so the idea that rather than just building one model you build multiple models the simplest form of doing that is in sense bagging you build many many different models and then you simply average the predictions if you don't have any reason to believe that one of your models is better than the other averaging is perfect but if you try say different models decision trees logistic aggression all of them there's no reason to believe that they're all equally good or bad and maybe you want to have a second layer of model sitting on top of it and rewe the evidence the predictions that it gets from these many different submodels and pick which one actually has more signal than the other so that's what people looked at in Ensemble learning and that's arguably one of the techniques that have been very very prevalent in winning competitions and really pushing predictive performance to the extreme when you think of like the Netflix competition that was basically an ensemble of sorts and the idea is actually older than that people have thought of this in terms of what they called back then gated experts and the idea was maybe one model learns a specific scenario and the second model learns a different scenario and then there has to be that expert sitting on top that acts like a gate that figures out okay what kind of scenario am I right now looking at so which of those models should I listen to this term of gated experts was used in the artificial neur networks research back in the like 95 through 98 literature that I'm aware of mhm so the whole point here is you have two layers of models the first one is the regular methodology that you know and the second layer is another model that learns to reway the evidence meaning the predictions of the first layer to produce a final score that's the idea of stacking what happens now if you combine cross validation with stacking here's the problem with stacking on what data set are you going to learn that stacking model on top gated expert and you can obviously not use any data that was given to one of the original models because now you're clearly overfitting you are believing the models below simply because you're looking at the exact same evidence again MH so what you need for the second layer model is a different data set a holdout set so now we're back in the world of not having enough data cannot simply use the same data and one idea people may have and I want to caution very strongly against that well would that problem go away if I use cross validation on my basic models underneath because now all of their predictions are out of sample right all should be good so now I can learn that stacked model on top on these quote unquote out of sample predictions being features on the exact same data set and this is where the original problem of that inverse correlation that gets introduced by cross validation becomes a huge problem mhm because while we observe that the model was worse than random as it came out from the first layer all the second layer model has to do is flip the sign yeah and it can do that and so you can in fact build a super predictive model with that quote unquote methodology on entirely randomly generated data and we demonstrate this with some simulation experiment so you simulate random data you use cross validation to build a set of models and you then run it through a second layer model which essentially just flips the sign and comes up with out of sample excellent performance except when you really get a new data set that was never looked at it's back to performing at random so it seems we're very much at the edge of what's sort of theoretically possible that uh May there doesn't seem to be any new super algorithm that will come down the the pipe one day and and solve these problems for us that certain problems need a certain degree of information before we can even think about modeling them is that a fair characterization I think it relates to that point to me me I have filed it mentally as a form of meta overfitting or overfitting again because the moment you get real new data your performance is terrible so in that sense it fits that picture that you're learning something on a data set that either isn't there or isn't there in a meaningful way from your perspective the way you phrased the question it falls into the category of just fooling yourself when you're trying too hard and trying to squeeze water out of a stone MH and it can actually look good despite the fact that it was really a stone and there was no water yeah definitely um you'd mentioned how effective bagging has become in a lot of the uh data mining competitions which is a nice transition into the leakage p paper uh that you've also put out you've covered a lot of different types of leakage and I almost sort of established a taxonomy of of different forms of leakage which I thought was really helpful could you talk a little bit about those different forms of leakage that one might want to know about so let me first attempt to Define leakage and that's actually not easy and in the paper we did struggle a lot with a somewhat formal definition the informal definition is you are learning correctly predictive information on your data set but the reason it is predictive is not truly reflective of the underlying data generating process you really want to learn about but it's an artifact of a number of steps taken in the data assembly and pre-processing mhm in some cases it is actually truly that you're learning basically the future because there is information about what really should happen in the future that snuck in because somebody prepared the data knowing about the future sometimes it's a form of sampling bias the notion of leakage covers a multitude of different problems that typically relate to the fact that something about your data processing created an artificially a signal that you learned and the worst part is the test set which you also Drew from that data set you have has the exact same problem and therefore it's not overfitting in the technical sense because you can't detect it without of sample learning so just keeping a part of the data set aside will not tell you that you had a problem I think it might be easier for listener to relate to this on the number of examples that I have presented in the paper so I wanted to share a few of them the first time I really saw it for what it was was on a data set in the medical domain from a data mining competition that was run by Simon's medical mhm the task was to identify breast cancer from FMI images and they were heavily processed so you actually didn't really know what anything meant all you had is 127 numeric features and you had to build a model on top of them this was a relatively straightforward task again we had a low base rate the usual problems appealing some feature selection but eventually when we were trying to understand the data better what we observed quite accidentally so when you added the patient ID meaning the 10 digigit random number that was assigned to the patient your model performance went up by easily 30% H now usually I'm not in the habit of adding Social Security numbers or names to my predictive models because they really shouldn't predict anything they can they could be proxies for obviously gender and Age and and race maybe but in the case of hospital records where everything is anonymized and all you're getting is a quote unquote random patient ID maybe it correlates with time but other than that there's no reason to believe that breast cancer is that strongly correlated with time so what's going on here and as we spent more time what became apparent there were really four quite distinct groups of patients so the patient ID wasn't really a uniform distribution but there were what I would call Four buckets and the lowest bucket actually had about 30% cancer the highest bucket had no cancer and the midsection you had um some kind of closer to the 6% that the overall data set the base we of 6% was showing H why would having a really high patient ID being an indicator of you being safe and for sure not having breast cancer the reason is that in order to provide enough data for this challenge people pulled data from different sources the data set we had was a mixture of data collected from four different locations which doesn't sound like a bad idea in the first place because yes I mean having more data certainly better but what happened was that these locations correlated with the prevalence of cancer for say reasons that one was a screening facility and another one was an actual treatment facility and the screening facility had very very low cancer rates the treatment facility at very high High cancer rates MH now clearly if I wanted to predict whether a person has cancer it does matter whether I'm already treated for some cancer or not right so it was in some sense an artificial problem and I should have been told the fact that this is where the data came from what really came to show in this exercise is the model turned out to be even without the uh patient ID quite predictive but there is kind of a lurking fact here that the location is implicitly encoded in these images because every of these fi machines has its own calibration so they have different levels of gray scale so all the model had learned through the average gray scale is where the patient was so in fact the model had no clue what to do with the images it just backed out the location even without including the um patient ID I'm sure the model picked up a lot of that signal meaning it actually looked a lot better than it would be if you were to use it in say yet a different location with a different calibration in fact all bets off what the model is going to do so this is an example where we really just learned something about the data collection process mhm and it really depends on how you want to use your model whether this thing that you learned kind of vastly corrupts your model or in fact you should have ADD added even more explicitly the information like location to your model and then deliberately us the model with a coding of the location in one of the four specified location mhm so that's one of the examples that that we had on leakage there were great other published examples that we had observed from research done at Amazon where they tried to predict for cross- selling opportunities who's likely to buy jewelry because hey maybe you want to Target them with jewelry sure how do you find yourself a training set for cross sale for jewelry you look in the category of jewelry whoever bought jewelry becomes a positive whoever didn't buy jewelry becomes a negative and you use purchases in all the other categories to predict whether or not you bought jewelry MH sounds good definitely okay here's one of the things the model found if the sum of revenue generated across all other categories is zero you're extremely likely to be interested in [Laughter] jewelry very true can't argue with that meaning people who buy absolutely nothing are really the best candidates for cross- selling now why does that happen well the only way you make it into the Amazon database is if you buy something right I mean they don't have records for people who never bought anything if the only thing you bought was jewelry then there was a clear kind of causal relationship not because it's causally that you buy jewelry but causal to be in the data set if you bought nothing else then you must have bought jewelry that's the only way you could have made it into the data set yeah it's like a topological leakage exactly and again this model is completely useless MH all you learned is an artifact of the data collection process it all goes away if you do data collection correctly with timestamps because you don't really want to predict who bought jewelry in the past you really want to predict given everything you bought up to now m mhm How likely are you to buy jewelry in the next week quarter months whatever right yep and the problem goes away once you correctly frame the predictive task as moving forward in time and you correctly create your feature set and your training set according to the Tim stamp but often when you work with secondhand data that somebody hands to you you don't have the luxury to completely recreate everything from scratch with time stamps and so on but you just take what you're given and so once data gets normalized into these CRM systems where you have by person just kind of the total sum of Revenue by category no longer any time stamp inside you you lose that ability the same is true in medical applications where instead of having transactional level data of what diagnosis was done when but you only get kind of summary records of here's the patients and here are all the different things that are wrong with him you don't know what was wrong with him in what order and then being able to diagnose or predict likely cooccurrence of other sicknesses can easily suffer from the same problem so we have another couple of examples in the paper where we show that predicting pneumonia has exactly the same artifacts that if you had nothing or else for sure you had to have pneumonia so there were a couple of artifacts that were really just the model finding everything wrong with your data set and CH is you didn't even know about it yeah I think those are two fantastic examples of leakage that teach important lessons about I know being skeptical of your models and uh being really introspective of what they're doing there's some other types of leakage as well you covered a couple cases of what I might call cheating in these uh data mining competitions where competitors leverage leakage could you share some of the stories you have there here's the thing about data mining competitions usually the rule are clearly stated and everything that is in the data set is game mhm as long as you can find it probably many of the other competitors felt that us including the patient ID in the model to B breast cancer was a form of treating and I agree that if I really had to build a model for cement to be used I would have designed different solution than simply adding the patient ID are there particular other examples that you felt strongly about because I think those were really the two prime examples that we had on medical data sets where we just kind of recreated the artifacts of the data collection and then coded them specifically in in our models to take advantage off yeah there was one that that you'd referen that stuck uh very heavily in my mind it was the case where uh in The Social Network challenge that uh there were these anonymized social graph and that uh the team that won or at least ranked very highly was able to do so by identifying the source of that social network and sort of retrofitting it to publicly available data essentially de anonymizing it so I think that goes beyond what I was interested in in in terms of what can actually practically happen that for me actually goes closer to cheating in terms of retrofitting social networks or anything related to network data really creates a lot of problems when you start sampling there are entire subfields in Social Network modeling that talk about what is the right way of sampling because what happens when you remove entities you also remove links with them MH and so all of a sudden you end up with a lot of Orphans but similar to the Amazon example you know that in the beginning there were really very very few orphans so all the orphans are probably an artifact of samples where you remove the body that they are connected to that I would consider again a problem of data processing creating artificial information if you really just go out and de anonymize things yeah I'm yeah but that you actually know exactly what you're doing what I'm more fascinated with in leakages you actually don't know that you build a completely over predictive model that will fail the test of time the moment you try to put it in production so I do assume that for practical applications usually people try to do the right thing and what I find fascinating is if you even when you try to do everything right you walk down the wrong path just finding additional data is a separate conversation that has probably less practical relevance and yeah that makes sense and uh even you know we could say including an ID is sort of a blunder one should know not to do that but to your point if there's an artifact of the calibration that would be a very easy mistake for someone to make um so I look at it differently I think the point the lesson here is not that I shouldn't have included the ID the lesson is only because I was bold enough to do so could I prove that there was something very substantially wrong with your data and the model was not to be used in practice Ah that's really interesting yeah so so look at it as a form of data detective work when you identify reasons why you model is too good to be true that's actually a skill that's incredibly important because the moment you wanted to use these models in the real world where now time moves forward the way it usually does they perform terribly and then the practitioners will just walk away and like what the hell I mean we paid them a lot of money to build us a model and guess what it's completely useless so I think there's a lot of skill in honing or there's an importance in honing your intuition to find issues of that sort that may make things look too good to be true oh with that in mind maybe we could wrap up with just sharing some of your insights on how one might detect a situation where they've gotten leakage in their model so one of the hardest parts is that it really requires a very good combined understanding of a the domain that you're currently working in like some medical understanding if you want alongside with kind of statistical intuition this is not about knowing how to build models and knowing all the processes it's about having a gut feeling of how well do you think you should be able to perform there was a competition where people had to predict the stock market mhm and if you build a model that has an a of above 0.55 you did something wrong yeah something is wrong maybe not you but something is wrong we had recently in our work a uh model where somebody had to build a model to predict the probability that a person clicks on an ad and he found a good 5% of people that had an above 90% probability of clicking on an ad H I cannot possibly imagine that you could ever predict that degree of certainty that a person will click on an ad it just it just doesn't sit right with my intuition it's just not possible period so things about human behavior there are certain things that are really hard to predict I have no idea whether I will have pizza tonight I don't know if I don't know then what machine could possibly predict this yes you can do better than random but not much better on the other hand people have shown that for instance on the Facebook data set predicting things related to for instance sexual preferences you can get very high accuracies on that one so there is a real skill in understanding your problem well enough to form a good intuition as where the upper bound should be given the data that you have and this is really what guides a good or a great data scientist's work is that intuition that something smells weird because it's just too good to be true you want to go back and triple check to make sure it's exactly what you want it to be absolutely I think that's great advice some more great advice I would give is that everyone go to the show notes and uh follow the links to both these papers they're uh excellent reading for senior and junior junior data scientists alike so really appreciate you taking time to come on the show and share your insights it was a lot of fun I appreciate it excellent well thanks again Claudia take care for more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on itun or [Music] Stitcher

Original Description

This week is an insightful discussion with Claudia Perlich about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our discussion covers some novel observations about ROC and AUC, as well as an informative discussion of leakage. Much of our discussion is inspired by two excellent papers Claudia authored: Leakage in Data Mining: Formulation, Detection, and Avoidance and On Cross Validation and Stacking: Building Seemingly Predictive Models on Random Data. Both are highly recommended reading!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 28 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video teaches how to detect and avoid common pitfalls in machine learning, such as leakage and overfitting, and provides insights on how to build robust predictive models on random data. The key takeaways include the importance of cross-validation, stratification, and regularization in model evaluation and the need for domain understanding and statistical intuition to detect potential issues.

Key Takeaways

Evaluate machine learning models on unseen data
Use cross-validation to fix the problem of limited data
Break up data sets into chunks for cross-validation
Learn on N minus one chunks of data
Use stratification to maintain the base rate when there are few positive examples
Detect leakage in data by honing intuition and understanding the domain

💡 Leakage can be detected by honing intuition and understanding the domain, and models that are too good to be true may be too good because of leakage

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Learn Deep Learning by Hand (Beginner's Guide - Part 1)