Linear Regression Plots in R
Skills:
Reading ML Papers80%Research Methods80%Paper Reproduction70%ML Maths Basics60%Supervised Learning60%
Key Takeaways
The video demonstrates linear regression plots in R, covering residual vs fitted values, normal Q-Q plot, scale-location plot, and residual vs leverage plot, using tools like R, lm function, and dplyr.
Full Transcript
if you've created a linear model in r you may have seen these four graphs that appear when you plot your model in this video we're going to cover exactly what these graphs mean and how you can use them to interpret the validity of your model i also have a video covering what the summary output of the regression means and you can watch that by clicking the card in the corner anyway let's get started our data set is going to be the same from last time it contains the species weight height and width of different fish and i'll put a link to it down in the description we're going to start by loading in our two libraries radar and dplyr radar for loading in our data and dplyr for manipulating data and working with data frames and we're also going to load in our fish.csv into this data frame called fishdf we can run our code and take a look at our fish df data frame we see that we have the species weight height and width of about 159 different fish now let's go back to our code and build out a linear model we want our linear model to be able to predict the height of a given fish based on the width and the species we'll name our model fish model we'll call the lm function height is going to be our y variable and we're going to predict it with the width and the species of the fish and the data we're going to be using is our fish df data frame now we can go ahead and create our linear model and if we want to take a look we just have to look at the summary and we get our output now if any of this is confusing to you i have a video explaining exactly what this output means in the context of our regression so i recommend watching that first but before we start plotting our model let's just refresh on what assumptions we need to meet in order for us to be able to use a linear model the first assumption is linearity the relationship between our x and y variables should be somewhat linear the second is homoscedasticity this means that the variance of the residuals should be the same for any value of x third is independence the observations should be independent of each other and fourth is normality this is the assumption that the residuals are normally distributed we'll come back to all of these in a bit but for now we're going to plot our model and look at the plots so we can write plot fish model and run that code let's take a look at the first plot though the residuals versus fitted values a residual is the difference between the observed value and predicted value so if we put that in the context of our data set it's a difference between the actual height of the fish and our model's prediction of what the fish's height should be so if we were able to perfectly predict the height of every fish then the residual would be zero for every observation however most models aren't perfect at predicting so there's usually going to be a residual that'll either be positive or negative so the residual versus fitted graph can be helpful in telling us if we're using the appropriate type of model for our data set we're using a linear model so we want to make sure that there's a linear relationship between our variables what we're looking for are the residuals to be both negative and positive and to be randomly scattered throughout the graph with equal variability like this the two things we want to look out for in the graph are heteroscedasticity and a non-linear relationship heteroscedasticity the opposite of homoscedasticity essentially means that the variability of our predictions in this case the predictions of our heights are not equally variable throughout translating that to our graph we would have heteroscedasticity if the residuals were closer to the zero line for some values but then further away for different values in fact that's exactly what we see here the residuals are initially close to zero but get further away as our predicted height increases now conceptually this would make sense because if we have a tiny fish with a small width it'll likely have a small height but as the width gets bigger and bigger it might be hard to predict if this is a long thin fish or a short wide fish now just because we have heteroskedasticity it doesn't mean that within species are bad predictors for height it does mean though that we won't be able to accurately predict heights consistently and should look into adding maybe more variables into our model now the other thing we want to look for in the plot is a non-linear relationship in that case we'd expect to see weird patterns in our plot for instance we might find the clump of residuals all below zero for some fitted values and then another clump all above zero for other fitted values in this plot even though we have some heteroskedasticity there's about an equal number of residuals above and below the zero line throughout so there's likely a linear relationship here between the height and then the width in the species but what would this plot look like if we were instead predicting weight based on width instead of height based on width now my assumption is that weight would increase a lot quicker than width because the relationship isn't linear so let's actually create a new model and take a look we'll create a new model called fish weight and here we're predicting the weight of the fish based on the width of the fish and again the data we'll be using is our fish df data frame and we can also call the plot command with our fish weight linear model and run the code now if we look at the residuals versus fitted graph we definitely see some weirdness in our plot first of all all the residuals have low variability for the small weights but pretty high variability for the larger weights so we already know that there's some heteroskedasticity here another thing we see are a lot of positive residuals for the lower weights negative residuals around the middle and then huge positive outliers towards the large weights conceptually this would mean that we're predicting very low actually negative weights for the smaller fish and then predicting weights that are too large for the medium-sized fish and then we're just really bad at predicting weights for the big fish because there's a lot more variability so we can definitely assume that the relationship between weight and width here is non-linear or else the residuals would be more equally above and below the zero line so now we're going to move on to our normal qq plot or the quantile quantile plot and we're going back to our initial linear model which was predicting height based on width and species of our fish so as i mentioned before one of the assumptions of linear models is that the residuals should be normally distributed this normal qq plot can actually show us if those residuals are normally distributed by comparing them with an actual normal distribution if we think of the classic bell curve normal distribution where the mean is zero we'd expect the distribution to look something like this about 70 percent of the data would be within one standard deviation of zero and about 95 percent of the data would be within two standard deviation away from zero in fact if we look at the x-axis which is labeled as theoretical quantities we see that there's a lot of data points clustered in the center with most data being between negative 2 and 2 and just a few extremes on the end this is because the data points are actually normally distributed in relation to the x-axis now let's take a look at the y-axis or the standardized residuals we're still talking about the same residuals as last time the observed heights minus the predicted heights and just like the x-axis we've standardized all those residuals by centering the mean on zero and each unit represents one standard deviation so if these residuals were perfectly normally distributed we'd expect the points to be straight and follow that dashed line since we know that the points are already normally distributed along the x-axis instead what we see are outliers on both sides of our data set we'd expect 95 of the data to lie within two standard deviation away from the mean but we see a lot of data points below the negative 2 line and above the 2 line so this means that our data set likely has a lot of extreme values and that there are a relatively large number of extremely short and extremely long fish so it's a little difficult to predict what the heights are going to be just based off of this model so our plot actually also labeled some of these data points that are outliers like number 5 and number 30 up here so what i want to do is actually create a new table and compare these predicted heights with the actual heights and see why the numbers are so off so i'll create a new table called fish predictions that's going to use that fish df data frame and we're going to add a new column called predictions we're going to call the predict command and pass in our fish model and then we're going to tell it that the new data is going to be fish df which is essentially our original data source so all it's doing is creating a new column with the predictions and appending that to the original phdf data frame we can go ahead and run this and take a look at fish predictions so if we take a look at our data frame and we look at observation number five here we're predicting that the height is going to be 14.7 in reality it's 12.4 and it seems like the reason we're predicting such a high height is that the width and the weight are relatively high compared to what some of these other fish are similarly if we go to observation 30 we're predicting that the height is about 17.1 based on our weight and our width but in reality the height is 18.95 so we're massively under predicting here and that's the reason that the residuals are so extreme in this situation so now we're going to take a look at our third plot the scale location plot which is also called the spread location plot and this shows us if the residuals are spread equally among our predictions so we can check the assumption of homoscedasticity or equal variance of the residuals we've got the fitted values on the x-axis similar to our first plot but the square root of the absolute value of the standardized residuals on the y-axis if we want equal variance in our residuals we'd want the dots to be pretty evenly scattered throughout the whole graph and show no pattern and we want our red line to be relatively horizontal we already determined from our first plot that the model violates the assumption of homoscedasticity since the variants of the predictions are larger for longer fish and we can pretty clearly see that the y values tend to increase the larger that the fitted values become and the red line reflects that with this slight upward trend so as you probably have figured by now there's definitely some overlapping information between the plots so you may not be checking all four plots when doing an analysis of your model but it certainly helps to know what each plot is trying to show and the last plot we're going to look at is the residuals versus leverage plot this plot helps us find influential data points if any because our linear regression works by minimizing the total error over all observations if we include data points that lie way outside the rest of our data they could actually have a pretty big impact on our model another way to put it is that the data point may not follow the general trend and if we include it our model might drastically change just to minimize that one data points residual so looking at this plot we have leverage on the x-axis and standardized residuals on the y-axis leverage essentially means how much influence does the data point have on the model instead of trying to find a pattern we're actually now just looking for values in the upper right or lower right corners these would represent points that have a lot of leverage so very influential on our model but also have large residuals so they're pretty far off from our estimate specifically we want to see if there are any points that lie outside the dashed red line cook's distance meaning they have a high cook's distance score so if we had data points that were like this and we removed them from our model they would likely have a big impact on the coefficients and the intercept of the model but in this case we don't really have any influential outliers that we need to remove but what i'm going to do is manually modify our fish df data frame and show you what it looks like when we have an outlier that we probably should remove from our model so i'm going to set our first observation's height to 100 so i've set the height of the first observation to 100 i'm also going to recreate our fish linear model and then plot the model and i can run all this code and now we see this data point observation one that lies outside our cook's distance line so we might want to remove it from our model if we were to see this in our data so now we have a pretty good understanding of the four plots that are part of the linear regression along with the summary output i'm going to add some additional resources in the description if you're interested in learning more but thanks for watching the video and i'll catch you in the next one
Original Description
Linear Regression Plots in R Explained
When plotting your linear regression model, you'll see the following 4 graphs:
- Residuals vs Fitted Values
- Normal Q-Q (Quantile-Quantile) Plot
- Scale-Location / Spread-Location Plot
- Residuals vs Leverage Plot
We'll cover what each of these graphs mean and how you can use them to interpret the validity of your linear regression model.
Timeline:
0:00 Intro
2:09 Residuals vs Fitted Values
5:45 Normal Q-Q (Quantile-Quantile) Plot
8:58 Scale-Location / Spread-Location Plot
10:03 Residuals vs Leverage Plot
Dataset: https://www.kaggle.com/aungpyaeap/fish-market/version/2
Part 1 (Regression Summary): https://www.youtube.com/watch?v=7WPfuHLCn_k
Additional info: https://data.library.virginia.edu/diagnostic-plots/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Dataslice · Dataslice · 16 of 16
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
▶
Web Scrape Text from ANY Website - Web Scraping in R (Part 1)
Dataslice
Web Scrape Nested Links/Multiple Pages - Web Scraping in R (Part 2)
Dataslice
Animate Graphs in R: Make Gorgeous Animated Plots with gganimate
dataslice
Word Clouds in R: Useful & Beautiful Word Clouds with wordcloud2
dataslice
Web Scrape in Google Sheets: IMPORTHTML, IMPORTDATA, & IMPORTFEED Functions (Part 1)
dataslice
Web Scrape in Google Sheets: IMPORTXML Function (Part 2)
Dataslice
Make Interactive Graphs in R: Creating & Embedding Interactive Graphs with plotly
dataslice
R Maps: Beautiful Interactive Choropleth & Scatter Maps with Plotly
Dataslice
Build a Data Science Portfolio (Free & Easy) with Jekyll & GitHub Pages | Part 1: Site Configuration
dataslice
Build a Data Science Portfolio (Free & Easy) with Jekyll & GitHub Pages | Part 2: Adding Posts
dataslice
Build a Data Science Portfolio (Free & Easy) with Jekyll & GitHub Pages | Part 3: Customization
dataslice
Regex Basics | Match, Extract, and Clean Text
Dataslice
Scrape Websites with Regular Expressions
Dataslice
Dplyr Essentials (easy data manipulation in R): select, mutate, filter, group_by, summarise, & more
Dataslice
Dplyr Advanced Guide: data cleaning, reshaping, and merging with lubridate, stringr, tidyr, ggplot2
dataslice
Linear Regression Plots in R
dataslice
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (5)
Intro
2:09
Residuals vs Fitted Values
5:45
Normal Q-Q (Quantile-Quantile) Plot
8:58
Scale-Location / Spread-Location Plot
10:03
Residuals vs Leverage Plot
🎓
Tutor Explanation
DeepCamp AI