[MINI] Heteroskedasticity

Data Skeptic · Intermediate ·🔍 RAG & Vector Search ·9y ago

Skills: RAG Basics80%RAG Evaluation70%

Key Takeaways

The video discusses Heteroskedasticity in the context of linear regression, analyzing traffic ticket data and its relationship with average household income, highlighting the importance of checking for unequal variance in data analysis.

Full Transcript

[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism our topic for today is heteros skas how do you pronounce it heteros skas toity that's pretty good actually before we get directly into our topic I just wanted to explore this particular data visualization that I saw online listeners can also see it in the show noes for this episode maybe you could just describe in your own words what you see there is a graph what is the xaxis represent the average household income presumably for a particular ZIP code or city that this uh data set is based upon and then what does the why say I don't understand that is the average number of tickets issued per person in that geographic area tickets from the police obviously so from this chart it looks like the more money you make the less traffic tickets you have uh well there's also one thing that's implying that what is present in this chart that might make oh there's a line yeah so if the line wasn't there can maybe you block it out in your mind and what do you see then without the line I don't know I just see a bunch of dots it's hard it seems to suggest though as you make more money you just get less tickets or something yes indeed so there is absolutely a claim here I would say that the person who created this graph is making the claim more traffic tickets are issued per person in ZIP codes with lower average income now I have a lot of problems with this graph and I thought it would be fun to talk them through and and look at some of the assumptions this is charting the household incomes of people against the number of tickets in those Geographic areas so we could be falling into something called the ecological fallacy but we'll leave that for a subject for another mini episode that's why I make the point of saying you could live there or you could be driving through there this is just measuring the amount of tickets issued and then comparing it to the population there which are not necessarily the same people we're talking about when I was younger I didn't make much money MH and I'll admit I got more tickets well there you go I was not as safe driver and I made more irratic decisions and I was not patient so you make more money now and you're better at issuing bribes is that what I'm hearing so nowadays I make more money than eight years ago and I'm definitely a much more cautious driver because I've had so many tickets in the past I've just decided that tickets aren't worth it so I just don't like to speed if something doesn't seem safe I wait I'm much more patient and so yes I make more money but I think the solution is that I'm just a more experienced driver maybe we would say that higher income correlates with older age as does responsibility in driving could that be the case yeah I mean we could all admit there's a high percentage of 16-year-olds that get tickets yeah and car accidents so so looking at it that way this relationship seems plausible like I said I have some problems with this plot and I want to First Call out something called heyman's categorical imperative which says that before you investigate if a phenomenon is true you should first establish is the phenomenon even real here's one of the problems I have with this plot this makes a linear assumption it assumes that the right model to describe this is that as incomes goes up that the chances you'll get a ticket goes down Len mhm I don't think it would be linear no yeah so I don't think it' be linear either and this is a oversight I see a lot of people make they'll go into Excel and they'll add some just trend line without establishing if the underlying data appears to be linear or not actually let's just talk more about what the plot shows so the dots are the actual data that part is pretty clear right yes and we could say the line is the person's model that's what they think is the relationship between average household income and the chances of getting a ticket I mean if you think about the average income there's just more people who make the average in inome than people who are at the higher tiers anyway so yeah so this is definitely an imbalan sample and we're likely to have um the it appears the the creator of this plot just did an ordinary least squares regression that's going to buy us towards the most dense areas of data MH it's going to be less biased by the outliers so let's talk about some of those outliers what do you notice about the data points of highest income above about 85,000 they're above that line yeah all of them right another way to say that is that this model consistently under predicts for all high incomes do you think that's a property of a good model well I don't know you said on previous shows if it's too predictive it's a bad model yeah what you're referring to is the risk of overfitting I I definitely would not say this overfits the data does it underfit it maybe that would be the claim of assuming it's linear when it's not but let's also look at the lowest data points what can we say about all the data points making below 40,000 well they're all above the line too almost all of them yeah all but two it seems that this fit doesn't adequately capture incomes below 40,000 or incomes above about 85 leading us to believe that maybe a linear fit wasn't the best choice for this mhm now there's another thing we do more generally is we look at what are called the residuals we've talked about that before but I'm not sure if we talk about it enough do you know what residuals are no but I would assume it means left over left over after you subtract the actual observed data minus the models prediction so imagine each of those dots the model is trying to predict those measure the difference between the Dot and the line and you get the residuals that's the amount of error the what the model is not accounting for so whenever there's a pattern in the residuals that tells you that your model failed to account for something cuz if your model accounted for everything that's predictable your residuals would look like white noise just random data that leads us to this other point what I wanted to talk about here is heteroskedasticity so heteroskedasticity is the circumstance in which the variability of a variable is unequal across the range of values of a second variable that it predicts is that a clear definition for you no I don't know what that means let's say uh you asked a bunch of people to measure a pencil and they all had different rulers right everyone has a different tape measure at home how much do you think their measurements would vary everyone's pencil is different length if you think of the oldfashioned pencil but if it's a mechanical pencil they're probably all the same length yeah so let's say they all have the same mechanical pencil although good point there's variant due to manufacture uring and there's also variance due to the measurement oh okay they probably give her a take you know like 5 mm it'd be very small variance that's the variance now what if on the other hand you ask them to identify a neighborhood cat and measure its length from nose to tail so it's any neighbor's cat yeah any neighbor's cat well that will vary greatly now what do you think if we separated the the cats by how old they were do you think there's as much variance in kittens as there are in adults I don't know I don't look at kittens tails so I'm going to go out on a limb and guess that kittens there's less variance because they're all you know there's a certain like maximum size of a kitten right the I don't know what the biggest kitten would be but even humans like no human is born 3 feet tall so since they're smaller there's a less of a chance for variance there whereas as they get to adulthood you start to see different like genetic traits and breed traits where you can have really big cats like a Main and really small ones and stuff like that so the variance in size increases as the age of the cat increases up to adulthood I presume that is heteroskedasticity it's when the variance is conditioned on some other variable in this case age heteroskedasticity is another thing you surprisingly actually you can ignore it a lot and get pretty far because it generally won't bias a fit like the ordinarily squares regression we see here but it does definitely screw up analysis of variance that we talked about kind of recently heteros skes dis is one of the you have to check for in your data and sometimes try and smooth out like maybe with a logarithmic transformation or something like that so truth be told it's one of those things that's not the greatest crime to ignore but it's something a good data scientist should be aware of in their data is that bad well it's something you should account for data is neither bad nor good it is what it is as long as it's correct it is the data it's our analysis of the data that counts and if you fail to account for heteroskedasticity that is bad because you may arrive at a false conclusion because you're using a method that assumes homogeneity of variance across a data set and if that assumption is fails then your analysis is on shaky grounds thanks as always for joining me Linda thank you for more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on iTunes or Stitcher

Original Description

Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range.  For example, the variance in the length of a cat's tail almost certainly changes (grows) with age.  On the other hand, the average amount of chewing gum a person consume probably has a consistent variance over a wide range of human heights. We also discuss some issues with the visualization shown in the tweet embedded below.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 22 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

The video teaches how to identify and address Heteroskedasticity in linear regression analysis, using traffic ticket data as an example, and highlights the importance of checking for unequal variance in data analysis. It explains how Heteroskedasticity can affect model evaluation and analysis of variance. The video provides practical steps for analyzing data and evaluating model fit.

Key Takeaways

Measure the amount of tickets issued in a geographic area
Compare the amount of tickets issued to the population of the area
Add a trend line to the graph to model the relationship between income and the number of tickets
Check for Heteroskedasticity in the data
Evaluate the model fit using residuals

💡 Heteroskedasticity can affect the accuracy of linear regression models and analysis of variance, and it's essential to check for it in data analysis.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related Reads

n8n Question and Answer Chain Node: Build Retrieval-Augmented Workflows with Any Document [Free Workflow JSON]

Learn to build retrieval-augmented workflows with n8n's Question and Answer Chain node and any document using a free workflow JSON

Dev.to · Pirate Prentice

KNN vs. HNSW: Choosing the Right Nearest-Neighbor Algorithm for Your RAG Pipeline

Learn to choose between KNN and HNSW nearest-neighbor algorithms for optimal RAG pipeline performance

AnswerSurvivalRAG: What Happens When RAG Finds the Answer, Then Drops It?

Learn how RAG systems can fail even when they find the correct answer, and why it matters for reliable AI performance

Medium · Machine Learning

A RAG evaluator that admits what it can't judge

Learn how to build a reliable RAG evaluator that acknowledges its limitations, a crucial aspect of AI safety and robustness

Dev.to · Melissa D. Ellison

This FREE Tool Turns ANY PDF into Perfect Markdown (MinerU Live Test)

Prompt Engineer