[MINI] Heteroskedasticity

Data Skeptic · Intermediate ·🔍 RAG & Vector Search ·9y ago

Key Takeaways

The video discusses Heteroskedasticity in the context of linear regression, analyzing traffic ticket data and its relationship with average household income, highlighting the importance of checking for unequal variance in data analysis.

Full Transcript

[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism our topic for today is heteros skas how do you pronounce it heteros skas toity that's pretty good actually before we get directly into our topic I just wanted to explore this particular data visualization that I saw online listeners can also see it in the show noes for this episode maybe you could just describe in your own words what you see there is a graph what is the xaxis represent the average household income presumably for a particular ZIP code or city that this uh data set is based upon and then what does the why say I don't understand that is the average number of tickets issued per person in that geographic area tickets from the police obviously so from this chart it looks like the more money you make the less traffic tickets you have uh well there's also one thing that's implying that what is present in this chart that might make oh there's a line yeah so if the line wasn't there can maybe you block it out in your mind and what do you see then without the line I don't know I just see a bunch of dots it's hard it seems to suggest though as you make more money you just get less tickets or something yes indeed so there is absolutely a claim here I would say that the person who created this graph is making the claim more traffic tickets are issued per person in ZIP codes with lower average income now I have a lot of problems with this graph and I thought it would be fun to talk them through and and look at some of the assumptions this is charting the household incomes of people against the number of tickets in those Geographic areas so we could be falling into something called the ecological fallacy but we'll leave that for a subject for another mini episode that's why I make the point of saying you could live there or you could be driving through there this is just measuring the amount of tickets issued and then comparing it to the population there which are not necessarily the same people we're talking about when I was younger I didn't make much money MH and I'll admit I got more tickets well there you go I was not as safe driver and I made more irratic decisions and I was not patient so you make more money now and you're better at issuing bribes is that what I'm hearing so nowadays I make more money than eight years ago and I'm definitely a much more cautious driver because I've had so many tickets in the past I've just decided that tickets aren't worth it so I just don't like to speed if something doesn't seem safe I wait I'm much more patient and so yes I make more money but I think the solution is that I'm just a more experienced driver maybe we would say that higher income correlates with older age as does responsibility in driving could that be the case yeah I mean we could all admit there's a high percentage of 16-year-olds that get tickets yeah and car accidents so so looking at it that way this relationship seems plausible like I said I have some problems with this plot and I want to First Call out something called heyman's categorical imperative which says that before you investigate if a phenomenon is true you should first establish is the phenomenon even real here's one of the problems I have with this plot this makes a linear assumption it assumes that the right model to describe this is that as incomes goes up that the chances you'll get a ticket goes down Len mhm I don't think it would be linear no yeah so I don't think it' be linear either and this is a oversight I see a lot of people make they'll go into Excel and they'll add some just trend line without establishing if the underlying data appears to be linear or not actually let's just talk more about what the plot shows so the dots are the actual data that part is pretty clear right yes and we could say the line is the person's model that's what they think is the relationship between average household income and the chances of getting a ticket I mean if you think about the average income there's just more people who make the average in inome than people who are at the higher tiers anyway so yeah so this is definitely an imbalan sample and we're likely to have um the it appears the the creator of this plot just did an ordinary least squares regression that's going to buy us towards the most dense areas of data MH it's going to be less biased by the outliers so let's talk about some of those outliers what do you notice about the data points of highest income above about 85,000 they're above that line yeah all of them right another way to say that is that this model consistently under predicts for all high incomes do you think that's a property of a good model well I don't know you said on previous shows if it's too predictive it's a bad model yeah what you're referring to is the risk of overfitting I I definitely would not say this overfits the data does it underfit it maybe that would be the claim of assuming it's linear when it's not but let's also look at the lowest data points what can we say about all the data points making below 40,000 well they're all above the line too almost all of them yeah all but two it seems that this fit doesn't adequately capture incomes below 40,000 or incomes above about 85 leading us to believe that maybe a linear fit wasn't the best choice for this mhm now there's another thing we do more generally is we look at what are called the residuals we've talked about that before but I'm not sure if we talk about it enough do you know what residuals are no but I would assume it means left over left over after you subtract the actual observed data minus the models prediction so imagine each of those dots the model is trying to predict those measure the difference between the Dot and the line and you get the residuals that's the amount of error the what the model is not accounting for so whenever there's a pattern in the residuals that tells you that your model failed to account for something cuz if your model accounted for everything that's predictable your residuals would look like white noise just random data that leads us to this other point what I wanted to talk about here is heteroskedasticity so heteroskedasticity is the circumstance in which the variability of a variable is unequal across the range of values of a second variable that it predicts is that a clear definition for you no I don't know what that means let's say uh you asked a bunch of people to measure a pencil and they all had different rulers right everyone has a different tape measure at home how much do you think their measurements would vary everyone's pencil is different length if you think of the oldfashioned pencil but if it's a mechanical pencil they're probably all the same length yeah so let's say they all have the same mechanical pencil although good point there's variant due to manufacture uring and there's also variance due to the measurement oh okay they probably give her a take you know like 5 mm it'd be very small variance that's the variance now what if on the other hand you ask them to identify a neighborhood cat and measure its length from nose to tail so it's any neighbor's cat yeah any neighbor's cat well that will vary greatly now what do you think if we separated the the cats by how old they were do you think there's as much variance in kittens as there are in adults I don't know I don't look at kittens tails so I'm going to go out on a limb and guess that kittens there's less variance because they're all you know there's a certain like maximum size of a kitten right the I don't know what the biggest kitten would be but even humans like no human is born 3 feet tall so since they're smaller there's a less of a chance for variance there whereas as they get to adulthood you start to see different like genetic traits and breed traits where you can have really big cats like a Main and really small ones and stuff like that so the variance in size increases as the age of the cat increases up to adulthood I presume that is heteroskedasticity it's when the variance is conditioned on some other variable in this case age heteroskedasticity is another thing you surprisingly actually you can ignore it a lot and get pretty far because it generally won't bias a fit like the ordinarily squares regression we see here but it does definitely screw up analysis of variance that we talked about kind of recently heteros skes dis is one of the you have to check for in your data and sometimes try and smooth out like maybe with a logarithmic transformation or something like that so truth be told it's one of those things that's not the greatest crime to ignore but it's something a good data scientist should be aware of in their data is that bad well it's something you should account for data is neither bad nor good it is what it is as long as it's correct it is the data it's our analysis of the data that counts and if you fail to account for heteroskedasticity that is bad because you may arrive at a false conclusion because you're using a method that assumes homogeneity of variance across a data set and if that assumption is fails then your analysis is on shaky grounds thanks as always for joining me Linda thank you for more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on iTunes or Stitcher

Original Description

Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range.  For example, the variance in the length of a cat's tail almost certainly changes (grows) with age.  On the other hand, the average amount of chewing gum a person consume probably has a consistent variance over a wide range of human heights. We also discuss some issues with the visualization shown in the tweet embedded below.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 22 of 60

1 Data Skeptic book giveaway contest winner selection
Data Skeptic book giveaway contest winner selection
Data Skeptic
2 OpenHouse - Front end and API overview
OpenHouse - Front end and API overview
Data Skeptic
3 OpenHouse Crawling with AWS Lambda
OpenHouse Crawling with AWS Lambda
Data Skeptic
4 [MINI] Logistic Regression on Audio Data
[MINI] Logistic Regression on Audio Data
Data Skeptic
5 Data Provenance and Reproducibility with Pachyderm
Data Provenance and Reproducibility with Pachyderm
Data Skeptic
6 [MINI] Primer on Deep Learning
[MINI] Primer on Deep Learning
Data Skeptic
7 Big Data Tools and Trends
Big Data Tools and Trends
Data Skeptic
8 [MINI] Automated Feature Engineering
[MINI] Automated Feature Engineering
Data Skeptic
9 The Data Refuge Project
The Data Refuge Project
Data Skeptic
10 [MINI] The Perceptron
[MINI] The Perceptron
Data Skeptic
11 [MINI] Feed Forward Neural Networks
[MINI] Feed Forward Neural Networks
Data Skeptic
12 Data Science at Patreon
Data Science at Patreon
Data Skeptic
13 [MINI] Backpropagation
[MINI] Backpropagation
Data Skeptic
14 [MINI] GPU CPU
[MINI] GPU CPU
Data Skeptic
15 OpenHouse
OpenHouse
Data Skeptic
16 [MINI] Generative Adversarial Networks
[MINI] Generative Adversarial Networks
Data Skeptic
17 [MINI] AdaBoost
[MINI] AdaBoost
Data Skeptic
18 [MINI] The Bootstrap
[MINI] The Bootstrap
Data Skeptic
19 [MINI] Dropout
[MINI] Dropout
Data Skeptic
20 [MINI] Gini Coefficients
[MINI] Gini Coefficients
Data Skeptic
21 [MINI] Random Forest
[MINI] Random Forest
Data Skeptic
[MINI] Heteroskedasticity
[MINI] Heteroskedasticity
Data Skeptic
23 [MINI] ANOVA
[MINI] ANOVA
Data Skeptic
24 Urban Congestion
Urban Congestion
Data Skeptic
25 [MINI] The CAP Theorem
[MINI] The CAP Theorem
Data Skeptic
26 Unstructured Data for Finance
Unstructured Data for Finance
Data Skeptic
27 Detecting Terrorists with Facial Recognition?
Detecting Terrorists with Facial Recognition?
Data Skeptic
28 Predictive Models on Random Data
Predictive Models on Random Data
Data Skeptic
29 [MINI] Entropy
[MINI] Entropy
Data Skeptic
30 [MINI] F1 Score
[MINI] F1 Score
Data Skeptic
31 Causal Impact
Causal Impact
Data Skeptic
32 Machine Learning on Images with Noisy Human-centric Labels
Machine Learning on Images with Noisy Human-centric Labels
Data Skeptic
33 The Library Problem
The Library Problem
Data Skeptic
34 Stealing Models from the Cloud
Stealing Models from the Cloud
Data Skeptic
35 Data Science at eHarmony
Data Science at eHarmony
Data Skeptic
36 Multiple Comparisons and Conversion Optimization
Multiple Comparisons and Conversion Optimization
Data Skeptic
37 Election Predictions
Election Predictions
Data Skeptic
38 [MINI] Calculating Feature Importance
[MINI] Calculating Feature Importance
Data Skeptic
39 MS Connect Conference
MS Connect Conference
Data Skeptic
40 Music21
Music21
Data Skeptic
41 The Police Data and the Data Driven Justice Initiatives
The Police Data and the Data Driven Justice Initiatives
Data Skeptic
42 Studying Competition and Gender Through Chess
Studying Competition and Gender Through Chess
Data Skeptic
43 [MINI] Goodhart's Law
[MINI] Goodhart's Law
Data Skeptic
44 Trusting Machine Learning Models with LIME
Trusting Machine Learning Models with LIME
Data Skeptic
45 [MINI] Leakage
[MINI] Leakage
Data Skeptic
46 Predictive Policing
Predictive Policing
Data Skeptic
47 Mutli-Agent Diverse Generative Adversarial Networks
Mutli-Agent Diverse Generative Adversarial Networks
Data Skeptic
48 [MINI] Convolutional Neural Networks
[MINI] Convolutional Neural Networks
Data Skeptic
49 Unsupervised Depth Perception
Unsupervised Depth Perception
Data Skeptic
50 [MINI] Max-pooling
[MINI] Max-pooling
Data Skeptic
51 MS Build 2017
MS Build 2017
Data Skeptic
52 Activation Functions
Activation Functions
Data Skeptic
53 Doctor AI
Doctor AI
Data Skeptic
54 [MINI] The Vanishing Gradient
[MINI] The Vanishing Gradient
Data Skeptic
55 CosmosDB
CosmosDB
Data Skeptic
56 Estimating Sheep Pain with Facial Recognition
Estimating Sheep Pain with Facial Recognition
Data Skeptic
57 [MINI] Conditional Independence
[MINI] Conditional Independence
Data Skeptic
58 MINI: Bayesian Belief Networks
MINI: Bayesian Belief Networks
Data Skeptic
59 Project Common Voice
Project Common Voice
Data Skeptic
60 [MINI] Recurrent Neural Networks
[MINI] Recurrent Neural Networks
Data Skeptic

The video teaches how to identify and address Heteroskedasticity in linear regression analysis, using traffic ticket data as an example, and highlights the importance of checking for unequal variance in data analysis. It explains how Heteroskedasticity can affect model evaluation and analysis of variance. The video provides practical steps for analyzing data and evaluating model fit.

Key Takeaways
  1. Measure the amount of tickets issued in a geographic area
  2. Compare the amount of tickets issued to the population of the area
  3. Add a trend line to the graph to model the relationship between income and the number of tickets
  4. Check for Heteroskedasticity in the data
  5. Evaluate the model fit using residuals
💡 Heteroskedasticity can affect the accuracy of linear regression models and analysis of variance, and it's essential to check for it in data analysis.

Related Reads

📰
What is RAG ? | Completely Explained in 10 Minutes | Hindi
Learn the basics of RAG, a key concept in AI, and understand how it works in just 10 minutes
Dev.to · Kunal Garg
📰
AutoRAG vs RAGBuilder vs Red Hat AutoRAG: Which RAG Pipeline Wins on YOUR Data (and Their Shared OCR Blind Spot)
Compare AutoRAG, RAGBuilder, and Red Hat AutoRAG to determine the best RAG pipeline for your data, considering their shared OCR blind spot
Dev.to · Ahmet Özel
📰
The RAG Tax: Why Your Retrieval Pipeline Costs More Than the Model
Optimize your retrieval pipeline to reduce hidden costs beyond model inference
Medium · RAG
📰
How to Debug RAG Hallucinations: Building Semantic Observability for Production AI
Learn to debug RAG hallucinations by building semantic observability for production AI systems
Dev.to · ping wang
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →