[MINI] The Bootstrap

Data Skeptic · Intermediate ·📊 Data Analytics & Business Intelligence ·9y ago

Skills: RAG Basics80%Vector Stores70%RAG Evaluation60%Advanced RAG50%

Key Takeaways

The Bootstrap method is demonstrated for resampling datasets to refine accuracy and produce useful metrics, leveraging techniques such as sampling with replacement and the central limit theorem, related to Bagging algorithms like Random Forest.

Full Transcript

[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism today's topic is the boot [Music] strap so I don't know if you know this but during the last episode I announced that we won't be talking about the election results for a little while I want to do stuff on polling but not just yet well you're telling me right now so good to know I'm going to sort of break that rule a little bit cuz I want to talk about polling in general the more people you get the better your polls are going to be you know if you could poll every single person in the United States essentially get to n equal all then you'd have a very nice poll but that's unrealistic for a lot of reasons it's a similar story in a lot of data science you can't always get a big enough sample size Can't Always Get What You Want yeah pretty close I think that's exactly what those guys meant when they wrote that the bootstrap let's first talk about the technique it is a technique where you have a data set right let's say you have like a 100 sample data points yes put them into a bag Shuffle it all up pull out you know one random element M take note of what it was then put it back in you put it back in yeah you put it back in why what if you draw it twice well let's get to it that's a good question but let's just talk through the method first cuz you will draw it twice take it all up draw again until you draw 100 out how many duplicates do you think you're going to get well there's 100 then you put it back the first time the second time it's one out of 100 right as you draw a big enough set it's almost certain you're going to have some duplicates because there's only one way to draw it uniclear and that's uh very unlikely with a large enough set so then you have this let's call it secondary sample and you could train a model or do make a decision based on that then you could start from scratch repeat that whole process do it again again and in this way you could make n number of models based on this sample of a sample let's not worry just yet about why you would do that is it clear the process I'm describing sure okay now at the end of that you could average out the results of all those models based on all those weird double sample data sets with with replacement and the average of those Often by Central limit theorem is going to get close to the correct answer no why each of them represents a different sort of waiting of your initial data set so some will like lean in One Direction and others will lean the other way like let's say we wanted to estimate the average height of women living in Los Angeles and you went and surveyed 100 women now do you think 100 women adequately represent the entire no there's like billions of people here or is it Millions it's Millions let's just say you took a 100 women surveyed them on their height and then took the average how close is that average going to be to the true average average you surveyed everyone took the average that is the average no no you surveyed and you got like 100 women to participate oh to the true average yeah I guess it would be within 25% yeah probably something like that this is why statistics are useful you can have a small sample and help predict the larger population as that sample size grows of course you have a confidence interval that gets closer but it could be very easily thrown off right how would it get thrown off well you could be somewhere where there's a lot of tall models hanging out mhm what about the short models where do they hang out they don't have any there must be some no people don't want who's going to be in a movie with Tom Cruz he's quite short oh they just put him on a box or similarly you could end up with you know a very like maybe the WNBA is in town and some especially tall woman just happens to be in the survey she's going to skew it right yeah if you did a survey and 10 of your friends did a survey and they all got a 100 different women then at you could can you see how averaging all your surveys would produce a better result maybe yeah it should but in that case all of you went out and got a 100 different people what if you all talk to the same 100 people you would think you'd get the same answer right by just averaging all of them yes you would because everyone surveyed the same H right yeah that's right so essentially the bootstrap is saying that with this method now there's no guarantee it can work sometimes it doesn't work and I we'll talk a little bit more about what when and why it wouldn't work but it's possible you get a better answer without introducing any new data by using this method of resampling from your sample here's my best way of thinking like why it does work so it's not guaranteed to work all the time but if your sample is kind of like low Fidelity that it doesn't bring in enough examples to adequately really represent the population then this resampling means that on each trial you're going to have a different propensity for picking out certain cases that represent some subpopulation or characteristics of of people in this height example if you had a couple of especially tall people in there just having one or two can highly skew your result a little bit on a small data set but then they're also only again two people in that data set so on some of your trials you'll sample them and others you won't so you're kind of like downweighting them implicitly but the difference is you don't have to decide who to wait it just comes out naturally in the statistics and the central limit theorem kind of kicks in you'll recall we talked a while ago about random Forest that does bagging which is bootstrap aggregation that's the bootstrap we're talking about here that's a case of machine learning where you're trying to train a model and each model gets its data set from this procedure of sampling with replacement yeah I mean guess you might as well randomize it again so when I first heard about this I thought like well this probably doesn't hurt but why would it help cuz random right but it's the same amount of data as before you didn't get any new information but you're putting them back in and randomizing yeah that's different data you're right it gives you a different sample each time but why should that result in any new information cuz it's random I'm trying to decide if you really grasp this deeply or not it's like distillation you distill it once you distill it twice you're going to get a different result the second distillation yeah in a way it is I guess like distillation where you heat it up turn all the liquids into a vapor and then you try and capture it at the other end different things are going to come through yeah that's sort of a way of doing it oh and then what do they do alcohol producers you know when they do blending at the end why do they do that consistency yeah so to extend your analogy there if you're really distilling it then it's mixed later so you're getting the mix of all these different things and you get a better product from its components so yeah maybe we should have started with distillation that kind of has a nice analogous feel although I don't know that it works mechanically well then I think we should have a drink to choose it so Kyle why is it called the bootstrap the idea of the bootstrap is like if your boot has like one of those straps on the back and you pull it to quickly get it on rather than trying to wiggle your heel into a shoe this is a technique by which you could very quickly get something more of your data did you just make that up no no that's that's basically the real answer yeah I didn't say he was a poet he was a statistician I don't know I just thought of it like oh you're just a working person so you wear boots so you bootstrap it or something all right we can go with that one too so right let's see how how exactly you got it let me think if I can come up with a demo here hold on how about this Linda let's talk about an experiment you want to determine if men buy more expensive cars than women or what's the difference between um cars typically bought by the genders do you think we buy actually do you think there's a difference yeah I'm going to say men spend more money on their cars yeah that's probably right almost for sure right how much would you guess what's the dollar amount difference it's got to be like between 10 and 20 grand on average yeah I would bet right cuz if you just go up to the next tier it's just more expensive all right so you could survey 100 women and 100 men ask them you know how much they paid for their their current vehicle you could take the average of the two and subtract them and you get a nice estimate right out of those 100 people just 100 random men how many do you think own like Bugattis and stuff like that I don't know you don't see that many around so probably zero out of 100 yep but there's a chance you might randomly sample one of those or I mean that's kind of an extreme car but even out of a 100 people there's some probability you're going to get one outlier you know somebody that's driving driving some very expensive fancy sports car that's sort of like in its own class and there's also a chance you're going to get some guys that are driving hoopties right mhm if you were going to use the bootstrap how do you do it you just randomize and keep pulling from your bag to see put it back and keep pulling back there you go and when you average all of those there's no guarantee cuz if your model is truly a independent identically distributed sample of the full population and the bootstrap would have very little effect but with some likelihood the boot strap might give you a slightly more precise answer and in fact it can also give you confidence intervals because each one of those averages then the central limit theorem kicks in and you can say like all right well I got the average of the averages is my prediction but you also know the standard deviation of all the averages so now you have a nice little boundary on the estimate you're trying to create so for me I think the the rule of thumb is I expect the bootstrap is likely to help me when I don't have a big enough sample data set to sufficiently describe the system system I've just got some set that represents some lower Fidelity example of it well anyway that's the bootstrap all right Linda well thanks is always for joining me thank you for teaching me about shoes until next time I want to remind everyone to keep thinking skeptically of and with data good night l or shoes good night for more on this episode visit datas skeptic.com if if you enjoyed the show please give us a review on iTunes or [Music] Stitcher

Original Description

The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random Forest. We discuss this technique related to polling and surveys.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 18 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

The Bootstrap method is a statistical technique for resampling datasets to refine accuracy and produce useful metrics, related to Bagging algorithms like Random Forest. It leverages sampling with replacement and the central limit theorem to provide confidence intervals and estimate differences between groups.

Key Takeaways

Apply the Bootstrap method for resampling a dataset
Use sampling with replacement to produce a more representative result
Leverage the central limit theorem to average model results
Produce confidence intervals for population predictions
Downweight certain cases to represent subpopulations
Evaluate sampling bias and skewness
Apply the Bootstrap method in Bagging algorithms like Random Forest

💡 The Bootstrap method can be used to quickly get a more representative sample of data by randomly sampling with replacement, and can help with estimating the difference between two groups and providing confidence intervals.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related Reads

Job Hunting? Free Data Tools for Salary, Certification, and Visa Research

Boost your job hunt with free data tools for salary, certification, and visa research to make informed decisions

Dev.to · datapeek

Python for Data Science — Sampling and Why Your Conclusions Can Be Wrong

Learn how sampling affects data science conclusions and why understanding probability distributions is crucial

Medium · Data Science

Sleep-stage detection is mostly inference. Be honest about it.

Sleep-stage detection relies heavily on inference, which is often not explicitly acknowledged in consumer sleep tech

Dev.to · SleepTrace

Data Science Institute in Tilak Nagar — AI, ML & Python Training

Learn how to analyze business data with AI, ML, and Python training at the Data Science Institute in Tilak Nagar

Medium · Data Science

How to Get More Clicks on Pinterest - Pinterest Analytics for Beginners (Tutorial)