[MINI] The Bootstrap
Key Takeaways
The Bootstrap method is demonstrated for resampling datasets to refine accuracy and produce useful metrics, leveraging techniques such as sampling with replacement and the central limit theorem, related to Bagging algorithms like Random Forest.
Full Transcript
[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism today's topic is the boot [Music] strap so I don't know if you know this but during the last episode I announced that we won't be talking about the election results for a little while I want to do stuff on polling but not just yet well you're telling me right now so good to know I'm going to sort of break that rule a little bit cuz I want to talk about polling in general the more people you get the better your polls are going to be you know if you could poll every single person in the United States essentially get to n equal all then you'd have a very nice poll but that's unrealistic for a lot of reasons it's a similar story in a lot of data science you can't always get a big enough sample size Can't Always Get What You Want yeah pretty close I think that's exactly what those guys meant when they wrote that the bootstrap let's first talk about the technique it is a technique where you have a data set right let's say you have like a 100 sample data points yes put them into a bag Shuffle it all up pull out you know one random element M take note of what it was then put it back in you put it back in yeah you put it back in why what if you draw it twice well let's get to it that's a good question but let's just talk through the method first cuz you will draw it twice take it all up draw again until you draw 100 out how many duplicates do you think you're going to get well there's 100 then you put it back the first time the second time it's one out of 100 right as you draw a big enough set it's almost certain you're going to have some duplicates because there's only one way to draw it uniclear and that's uh very unlikely with a large enough set so then you have this let's call it secondary sample and you could train a model or do make a decision based on that then you could start from scratch repeat that whole process do it again again and in this way you could make n number of models based on this sample of a sample let's not worry just yet about why you would do that is it clear the process I'm describing sure okay now at the end of that you could average out the results of all those models based on all those weird double sample data sets with with replacement and the average of those Often by Central limit theorem is going to get close to the correct answer no why each of them represents a different sort of waiting of your initial data set so some will like lean in One Direction and others will lean the other way like let's say we wanted to estimate the average height of women living in Los Angeles and you went and surveyed 100 women now do you think 100 women adequately represent the entire no there's like billions of people here or is it Millions it's Millions let's just say you took a 100 women surveyed them on their height and then took the average how close is that average going to be to the true average average you surveyed everyone took the average that is the average no no you surveyed and you got like 100 women to participate oh to the true average yeah I guess it would be within 25% yeah probably something like that this is why statistics are useful you can have a small sample and help predict the larger population as that sample size grows of course you have a confidence interval that gets closer but it could be very easily thrown off right how would it get thrown off well you could be somewhere where there's a lot of tall models hanging out mhm what about the short models where do they hang out they don't have any there must be some no people don't want who's going to be in a movie with Tom Cruz he's quite short oh they just put him on a box or similarly you could end up with you know a very like maybe the WNBA is in town and some especially tall woman just happens to be in the survey she's going to skew it right yeah if you did a survey and 10 of your friends did a survey and they all got a 100 different women then at you could can you see how averaging all your surveys would produce a better result maybe yeah it should but in that case all of you went out and got a 100 different people what if you all talk to the same 100 people you would think you'd get the same answer right by just averaging all of them yes you would because everyone surveyed the same H right yeah that's right so essentially the bootstrap is saying that with this method now there's no guarantee it can work sometimes it doesn't work and I we'll talk a little bit more about what when and why it wouldn't work but it's possible you get a better answer without introducing any new data by using this method of resampling from your sample here's my best way of thinking like why it does work so it's not guaranteed to work all the time but if your sample is kind of like low Fidelity that it doesn't bring in enough examples to adequately really represent the population then this resampling means that on each trial you're going to have a different propensity for picking out certain cases that represent some subpopulation or characteristics of of people in this height example if you had a couple of especially tall people in there just having one or two can highly skew your result a little bit on a small data set but then they're also only again two people in that data set so on some of your trials you'll sample them and others you won't so you're kind of like downweighting them implicitly but the difference is you don't have to decide who to wait it just comes out naturally in the statistics and the central limit theorem kind of kicks in you'll recall we talked a while ago about random Forest that does bagging which is bootstrap aggregation that's the bootstrap we're talking about here that's a case of machine learning where you're trying to train a model and each model gets its data set from this procedure of sampling with replacement yeah I mean guess you might as well randomize it again so when I first heard about this I thought like well this probably doesn't hurt but why would it help cuz random right but it's the same amount of data as before you didn't get any new information but you're putting them back in and randomizing yeah that's different data you're right it gives you a different sample each time but why should that result in any new information cuz it's random I'm trying to decide if you really grasp this deeply or not it's like distillation you distill it once you distill it twice you're going to get a different result the second distillation yeah in a way it is I guess like distillation where you heat it up turn all the liquids into a vapor and then you try and capture it at the other end different things are going to come through yeah that's sort of a way of doing it oh and then what do they do alcohol producers you know when they do blending at the end why do they do that consistency yeah so to extend your analogy there if you're really distilling it then it's mixed later so you're getting the mix of all these different things and you get a better product from its components so yeah maybe we should have started with distillation that kind of has a nice analogous feel although I don't know that it works mechanically well then I think we should have a drink to choose it so Kyle why is it called the bootstrap the idea of the bootstrap is like if your boot has like one of those straps on the back and you pull it to quickly get it on rather than trying to wiggle your heel into a shoe this is a technique by which you could very quickly get something more of your data did you just make that up no no that's that's basically the real answer yeah I didn't say he was a poet he was a statistician I don't know I just thought of it like oh you're just a working person so you wear boots so you bootstrap it or something all right we can go with that one too so right let's see how how exactly you got it let me think if I can come up with a demo here hold on how about this Linda let's talk about an experiment you want to determine if men buy more expensive cars than women or what's the difference between um cars typically bought by the genders do you think we buy actually do you think there's a difference yeah I'm going to say men spend more money on their cars yeah that's probably right almost for sure right how much would you guess what's the dollar amount difference it's got to be like between 10 and 20 grand on average yeah I would bet right cuz if you just go up to the next tier it's just more expensive all right so you could survey 100 women and 100 men ask them you know how much they paid for their their current vehicle you could take the average of the two and subtract them and you get a nice estimate right out of those 100 people just 100 random men how many do you think own like Bugattis and stuff like that I don't know you don't see that many around so probably zero out of 100 yep but there's a chance you might randomly sample one of those or I mean that's kind of an extreme car but even out of a 100 people there's some probability you're going to get one outlier you know somebody that's driving driving some very expensive fancy sports car that's sort of like in its own class and there's also a chance you're going to get some guys that are driving hoopties right mhm if you were going to use the bootstrap how do you do it you just randomize and keep pulling from your bag to see put it back and keep pulling back there you go and when you average all of those there's no guarantee cuz if your model is truly a independent identically distributed sample of the full population and the bootstrap would have very little effect but with some likelihood the boot strap might give you a slightly more precise answer and in fact it can also give you confidence intervals because each one of those averages then the central limit theorem kicks in and you can say like all right well I got the average of the averages is my prediction but you also know the standard deviation of all the averages so now you have a nice little boundary on the estimate you're trying to create so for me I think the the rule of thumb is I expect the bootstrap is likely to help me when I don't have a big enough sample data set to sufficiently describe the system system I've just got some set that represents some lower Fidelity example of it well anyway that's the bootstrap all right Linda well thanks is always for joining me thank you for teaching me about shoes until next time I want to remind everyone to keep thinking skeptically of and with data good night l or shoes good night for more on this episode visit datas skeptic.com if if you enjoyed the show please give us a review on iTunes or [Music] Stitcher
Original Description
The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random Forest. We discuss this technique related to polling and surveys.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Skeptic · Data Skeptic · 18 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
▶
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Data Skeptic book giveaway contest winner selection
Data Skeptic
OpenHouse - Front end and API overview
Data Skeptic
OpenHouse Crawling with AWS Lambda
Data Skeptic
[MINI] Logistic Regression on Audio Data
Data Skeptic
Data Provenance and Reproducibility with Pachyderm
Data Skeptic
[MINI] Primer on Deep Learning
Data Skeptic
Big Data Tools and Trends
Data Skeptic
[MINI] Automated Feature Engineering
Data Skeptic
The Data Refuge Project
Data Skeptic
[MINI] The Perceptron
Data Skeptic
[MINI] Feed Forward Neural Networks
Data Skeptic
Data Science at Patreon
Data Skeptic
[MINI] Backpropagation
Data Skeptic
[MINI] GPU CPU
Data Skeptic
OpenHouse
Data Skeptic
[MINI] Generative Adversarial Networks
Data Skeptic
[MINI] AdaBoost
Data Skeptic
[MINI] The Bootstrap
Data Skeptic
[MINI] Dropout
Data Skeptic
[MINI] Gini Coefficients
Data Skeptic
[MINI] Random Forest
Data Skeptic
[MINI] Heteroskedasticity
Data Skeptic
[MINI] ANOVA
Data Skeptic
Urban Congestion
Data Skeptic
[MINI] The CAP Theorem
Data Skeptic
Unstructured Data for Finance
Data Skeptic
Detecting Terrorists with Facial Recognition?
Data Skeptic
Predictive Models on Random Data
Data Skeptic
[MINI] Entropy
Data Skeptic
[MINI] F1 Score
Data Skeptic
Causal Impact
Data Skeptic
Machine Learning on Images with Noisy Human-centric Labels
Data Skeptic
The Library Problem
Data Skeptic
Stealing Models from the Cloud
Data Skeptic
Data Science at eHarmony
Data Skeptic
Multiple Comparisons and Conversion Optimization
Data Skeptic
Election Predictions
Data Skeptic
[MINI] Calculating Feature Importance
Data Skeptic
MS Connect Conference
Data Skeptic
Music21
Data Skeptic
The Police Data and the Data Driven Justice Initiatives
Data Skeptic
Studying Competition and Gender Through Chess
Data Skeptic
[MINI] Goodhart's Law
Data Skeptic
Trusting Machine Learning Models with LIME
Data Skeptic
[MINI] Leakage
Data Skeptic
Predictive Policing
Data Skeptic
Mutli-Agent Diverse Generative Adversarial Networks
Data Skeptic
[MINI] Convolutional Neural Networks
Data Skeptic
Unsupervised Depth Perception
Data Skeptic
[MINI] Max-pooling
Data Skeptic
MS Build 2017
Data Skeptic
Activation Functions
Data Skeptic
Doctor AI
Data Skeptic
[MINI] The Vanishing Gradient
Data Skeptic
CosmosDB
Data Skeptic
Estimating Sheep Pain with Facial Recognition
Data Skeptic
[MINI] Conditional Independence
Data Skeptic
MINI: Bayesian Belief Networks
Data Skeptic
Project Common Voice
Data Skeptic
[MINI] Recurrent Neural Networks
Data Skeptic
More on: RAG Basics
View skill →
🎓
Tutor Explanation
DeepCamp AI