Linear Regression in R - Full Project for Beginners

Alejandro AO · Beginner ·📄 Research Papers Explained ·3y ago

Key Takeaways

This video demonstrates linear regression in R using a full project for beginners, covering topics such as data analysis, predictive modeling, and model evaluation with tools like R, ggplot2, and lm function.

Full Transcript

good morning everyone how is it going today my name is Alejandro and welcome to my channel here we talk a little bit about programming and coding in general today we're going to be going through a data science project uh one in which we will see a linear regression and how to apply it in R okay in order to follow this tutorial it's recommended that you already know a little bit what linear regression is and how it works although I will explain it a little bit pretty quick at the beginning if you want to deepen your knowledge in linear regression or just get a quick introduction before actually watching this video I recommend that you check check out my article that I published in medium it's titled a simple explanation of linear regression and then at the end of the article there will be this video explaining pretty much how these concepts are applied in R okay so without further further Ado let's get right into the project and let's create our linear regression model okay so let's go [Music] thank you so let's get started okay um we're going to be doing this in our studio to to manage our project and we're going to be using this data set right here okay I will paste the link in the description so that you can download it and follow along and the idea is that you have an e-commerce customer data set okay um you just have to click on download it and once you download it you're gonna have to unzip it and once it's unzipped you're gonna have to add it to your to your project and how do you do that okay one second um there you go so here I have my folder of my project and I created the date folder and inside the data folder I'm just going to add the unzipped file here right here I'm just going to rename it I'm going to rename it e-commerce e-commerce users there you go so now as you can see we have here our files and here we have e-commerce users inside our data data directory okay and this is the one that we're going to be using so um first of all we're gonna have to create a new R script I'm just going to call it main.r inside our regression tutorial it's right here and now we can start um analyzing our data okay sorry about that just gonna set this in silence mode there you go um so first of all what we're gonna have to do first is import our data okay import data and setup there you go I'm just going to add some divisions right here there we go like that and like that there you go and first of all we're going to import our data to that to do that we just use the read.csv uh command and then we just tell it where our data is located uh since we're working inside a project it's important that we that we use a relative path so this means that it's relative to the current position of my file so which is main R and then I just write data because in the current position was in Main and then we enter data and then inside data we just open Ecommerce that use users I don't know what this file doesn't have a CSV extension but it's a CSV file so I mean it should be all right I'm going to I'm going to run down and let's view it okay so let's view the data there you go so here this is how the data looks like um we have an email column we have an address column we have an avatar apparently which is probably the name of the user average session length we have the time of on the app because I mean these are users that are spending time on an e-commerce application and website okay and we have the time they spent on the website we have the length of the membership that they've had in this e-commerce website and they have the yearly amount spent in that e-commerce platform okay and the idea right here in this project is that we're going to try to predict based on this um variables like the time when they spend on the website the time they spend on the application and the time the average session um in this platforms we're gonna try to predict uh whether or I mean we're going to try to predict how much they are going to spend in the platform okay and to do that of course we're going to be using a linear regression um I'm not of course but this is just what the video is about we could use another model but we're we're going to be using what I'm going to be showing you how to do this with linear regression so let's continue analyzing um like just getting an idea of our data to do that we're going to check the structure of the data and I'm just going to move myself over there so that you can see what's going on here um so here we have that we have an email variable we have 500 observations that means that we have 500 rows each row is called an observation we have one variable which is the email and it's a character variable which means that like just just a string each bar each value is a string then we have the address which is also a string the Avatar which is also a string we have the average session length which is a number um and that's in minutes the time on app that I spend daily I suppose um the time on the website and the length of the membership in months if I'm not mistaken let's check that out do we have do we have an explanation for that um all right let's just suppose that it's an hours um e-commerce linear regression this is a very popular data set anyways that you should be able to find it um all right so dirty head is having data of customers who buys clothes online they off the store offers an in-store selling closing advice sessions customers come to the store have sessions meetings with personal stylists and they go home and Order either on the mobile app the website or on the clothes they spend all right so actually the session means the set the meet the length of the meeting with a personal stylist um it's important that you understand what what each variable actually means that you can actually give an a valuable Insight with this data um then they can go home and Order um either on mobile app or on the website the clothes they actually want okay and we're going to predict how much money they are going to spend in that um in that order and we're going to be trying to predict it with these variables um I don't have the actual nursery rhymes but that sweetens I mean as you can see for the data the time on app is probably in minutes the average session length is also in minutes because they went physically to this place and they spend some time in in place um which is in minutes 34 31 minutes then we have the time the time they spend on the website and the length of the membership in months that I have all right so yearly amount spent in dollars there you go so now we have seen the strength and the structure of the data what we can check right now is the summary of the data let's see so we have the email all right address characters Avatar we have the minimum session length to be 21 the maximum to be 36 so it's around 30 minute sessions for each client we have time on the application the shortest one was eight minutes and the longest one was 15 minutes they do tend to spend a little bit more time on the website apparently with 33 minutes and 40 minutes length of the membership we have some that have had the membership for 0.2 months and others that have had the membership for nearly seven months and the yearly amounts spent okay so there you go now that we have a little bit more of an idea of how our data looks like now we can actually start plotting the data to try to find try to seek correlations and what is um yeah I mean just to try to get get a feel of what the data um how it's correlated and represented Etc okay so we're going to add a new section that's called create plots and search four insights how about that all right so to for the plots we are um going to be using the basic R plotting system and we can also be using our sorry ggplot so I'm just going to add here the library ggplot2 if you don't have dgplot you're going to want to install it and to install it you just run install packages and then you just run ggplot okay okay um all right so first of all let's ask ourselves is there a correlation between the time spent and what on a website and the yearly amount spent um by by each user okay and we can do that with ggplot let's see uh Corey let's find correlation between time on website and yearly amount spend so let's see if there is now to plot um to make a plot in GD plot with with a ggplot we're gonna see we're gonna pass in the data as the first argument and then we're going to say that the x-axis is going to be the time on website uh there you go like that and then the y-axis is going to be the yearly amount spent which is what I have right here there you go just gonna copy paste this so that I don't make mistakes and let's add some let's make it a point uh scatter plot okay so we're going to add a geometric point fill and let's just say add a little bit of color we're just going to say that the color is going to be orange okay um just add a very quick title so that we know what these what this plot is going to represent so this one's going to be time on website against a yearly yearly amount amount spent like that and then let's add some labels as well to to add labels you just do xlab and then you run the function with the name of your label so this is going to be the X label is going to be time on website and the Y label is going to be the yearly yearly amount spent okay this is just a label so let's run down uh we're supposed to be where we have an error apparently um data correlation between area and ggplot could not find function ggplot here because I did not run my library now if I run it I still have an error because while I Y label is not supposed to be with a number case there you go um no there you go so here we have it here we have the curl I mean the Scout plot um um now as you can see there doesn't seem to as you can see there doesn't seem to be much of a correlation between the time they spent on the website and the yearly amount they spend um each I mean the client spends this looks very very scattered and it doesn't look very relevant to our to our study so I mean that's already an Insight let's check for other correlations all right let's see the correlation between the average session length for when the customer went to the store to actually try out the clothes and to take the measurements Etc versus the yearly amount spent so let's add again a ggplot let me just going to copy this one right here I'm just going to paste it down here so it's going to be the data but instead of being X time and website it's going to be session length where is the session length name here it is average session length and Y is going to be the yearly amount spent and here's going to be session length against yearly amount spend time on the website um session length that's going to be my X label and the Y labels is still yearly amount spent okay I'm going to run this and as you can see we have a little bit more of a correlation it's very very thin I mean it's very very small I I don't I mean but you can see that the longer the session they tend to spend a little bit more okay um so I mean it's definitely different to the one that we saw about the time they spent on website this one does look a little bit more correlated so all right so that's already a pretty good Insight um let's see what else what else can we can do how about we do a pair plot of all our variables okay um of all continuous variables um this is important I mean this this is just a basics of data data science and data visualization um if you're going to use a scatter plot both variables have to be continuous I mean they have to be numeric they don't have to be categorical okay in that sense since the session length is a number and the yearly amount spent you can actually plot a number to it but if it was just a categorical variable you would just see a column of points which is I mean it's useful sometimes but like it's not a real scatter plot so there you go what we're going to do right now is we're going to pair plot all the continuous variables together to try to find more correlations to see how this looks like okay the the the function to to code this is Pairs and we're going to pass in the data and instead of passing all of the data we're just going to subset some Columns of the data and to do that we're going to say which which columns we actually want so we're going to just say the we're just going to say the um we're gonna say the average session length of course this one is continuous we're going to pass in as well uh time on app because it's continuous to what else are we going to pass we're gonna pass in time on the website because it's continuous we're gonna pass in length of membership because it's continuous and last but not least we're gonna pass yearly amount spend because that's actually the response variable that we're looking at and that we're trying to predict okay so now that we have passed in the data that we are going to to be analyzing um let's just add some more just a little bit more parameters let's say the color first and I'm going to keep it to Orange because I don't know looks right um PCH let's set that to 16. and the labels yeah let's just add some labels I'm just going to add now you know what uh we can just do without the labels for now but yeah I mean for if you wanted to add some labels you can you can just add a a vector here and just add pretty much the same thing as you were adding up here and do yeah so this one is going to be average session length but instead of like it having a like um like a variable name you can actually name it with a more understandable name without having all the dots in there um I'd I mean just to show you how it works but I'm not going to do it here and let's I'm I am actually going to add a title and the main is going to be a pear blood of all continues variables there you go I know if we run this one we see that we have a pair plot and actually I'm going to zoom in a little bit on this one to actually show you what is going on uh so here we have pretty much the same as we were doing before but for all of the continuous variables so the you can see a little bit more clear how this data set uh behaves so first of all we have they our I mean our response variable we set it at the end so that's pretty convenient because I mean we can see more clearly which variables have a correlation with it okay so let's say we have that the length of membership is actually the most correlated one as you can see the more the long I mean the way this spare plot works is that you have your yearly amount I mean the variable right here is the x-axis and the variable right here is the y-axis okay sorry this one is the x-axis and this is the y-axis because this one this one's right here deck so that means that this one's right here and this one right here is right here and this one's right here okay so X is length of membership and Y is yearly amount spent so here you can see that as the length of membership increases the yearly amount spent increases as well which is kind of logical I guess because I mean the more time you have spent being a member of this um of this store of this community the more time you have had to actually buy things then we have the time spent on website doesn't really seem very correlated so I mean it's here but it doesn't look extremely good I don't know here we have the time on the app the day this one does look a little bit correlated to the yearly amount spent and then last but not least we have the average session length and it does look a little bit correlated too so there we have it now we have a now we have seen um somewhat more somewhat more um now we have more of an idea of what our data set actually looks like and the relationship between different variables okay so that's pretty important uh what we're going to do right now is we're going to wait a second just let me add a new just let me add a new section right here and we're going to do we're going to explore the selected variables okay let's see to to actually let to see what they actually look like themselves without actually looking at the correlations so let's see exploring the selected variables there you go I mean this can be useful if you're going to be using uh different models some models do require link normality some models don't but I mean this is just to see okay the condition for it being a forfeiting a linear regression is that the data has to have a linear relationship I mean that's an assumption um but right now let's just exploded in the selected variables okay so I mean just as a matter of exercise let's see if our variables are normally distributed okay so is the variable normally this tree muted there you go um now to do this we can actually use I mean what I usually what I what I would recommend that you do is just use the basic histogram from R you can do it like this data and then you say length of membership like that this is going to show you a histogram of your length of membership data pretty important you can use a histogram because your data is continuous I mean your variable is continuous if it wasn't continuous you would rather you would use a bar plot instead right so this is the histogram it does look a little bit normally distributed so that that's looking pretty good I'm just going to show you how to do that on ggplot pretty quick so you would do ggplot as well and same just as before you would you would I mean I guess I could have just stored this in a variable but I'm just going to copy it like that um so you can do Gigi plot data and then we're not considering the time on the website we're considering the length of membership and we're not we're not going to have a y variable because we're just measuring just a histogram of one single variable and we're going to do GM histogram there you go like that um and I mean the idea of using ggplot is that you can customize it a little bit more so I'm just going to add a color let's say that it's going to be white um let's say that the fill is going to be orange because just to keep it consistent and actually you can now also I mean you can also do this in in regular R function but you can set a bin width so here as you can see the hour bin is 0.5 if we don't set it I'm not sure to wait to what it will default so it wasn't to 0.5 it was to 0.25 apparently um no I don't know what the bin bin size is right here but I mean you can actually customize it just said bin it's been been with like uh not bit bin bin with and let's set it to 0.5 if we say hit enter then we have a similar histogram to what we had before in our R function just that R is a little bit quicker I mean the base function is a little bit quicker but it's less customizable okay there you go um so it does seem to be normally distributed um so now just pretty quick what we're going to be doing is we're going to be plotting this same variable as well because it's the one that we're going to be choosing for our linear regression model and we're going to be plotting it with a box plot this time all right so instead of using the histogram function we're going to use the box plot function like this one it's the base function from R and we're just going to pass in our variable right uh so let's say length of membership like that let's run it and we have our box plot over there so we can see it actually does look normally distributed and we do have some outliers but they're not very very far away um what else we have let's do the same thing but with ggplot okay um to I mean we're going to be doing pretty much the same thing as we did up here but instead of adding a histogram we're going to be adding a GM box plot like that there you go and how about we add a fill color of orange just to keep things consistent again there you go and I'm pretty sure that's all there we go uh as you can see we have our box plot now and I mean it's horizontal of course because I passed in the variable in the x-axis but if I pass it in the y-axis we see that it actually takes the form of the previous uh plot that I showed you before I'm just going to leave it right here as an horizontal one there you go so now that we have chosen our variable that we're going to be using for our for our linear regression model um we're actually gonna let's just actually fit our linear regression model to our variable and our response variable okay so let me add just a new a new section right here like this there we go like this and here this one's going to be called fitting a linear model there we go first of all I'm just going to attach the data so that whenever I this means that whenever I run a function I just have to type in the name of the variables not not actually mention that they come from this data I mean it just saves me time I suppose so I'm going to create this linear model fit I'm going to store it in a variable called lnfit1 and the function to actually create a linear model is Ln for linear model and it takes um as an argument the first thing that you have to write in here is the response variable that you expect to get so for example we expect to get the the yearly amount spent like that and after that you write a tilde like that one and here you write in the variables or the variable that you expect that that are the predictors okay so right here we have chosen length of membership as our predictor so I'm going to pass it in like that and and there you go so now let's run down uh actually yeah I'm gonna run attach data I'm going to run lmfit like that so now we have lmfit and you can see that we have Alan fit up here it's a list of 12 elements and actually we can start looking at Ellen fit and to see I mean to to our linear model and try to get some insights from it okay so let's just check the summary of it the summary of our lmfit we just wrap wrap it wrap the the variable that you were using to fit it and let's just run down I'm just going to move myself over here and let's see so here we go we can see the function that we called and we actually did call um this one is the response that we expected and this one is the the predictor we have the residuals as well we're going to analyze the residuals in just a moment but let's focus on something else okay remember that right here what we're doing is we're finding not this one here you go we're finding we're trying to find this right here uh nope sorry not down this right here so this is a linear regression with just one predictor the one predictor we have is the X right here and here we have the coefficient of x which is the the weight the weight of of our variable and we also have the the intersect with Y and of course the predictor I mean the intersect is just when y equals zero okay uh so when x equals zero um so there you have it here we have our intercept which is 272 dollars because of course we're talking about the predictor which is in dollars the yearly amount spent and we also have the length of membership um the coefficient of this one is 64. that means that um that means that wait one second um yeah it's 64 that just means that it's uh I mean it's positive and it's a positive relationship with the with the length of membership we also have some more variables right here we have the standard error which is pretty a pretty important measurement we have the T value for the student t-test and you also have the significance of with the P value okay as you might remember from your statistic classes if the p-value is lower than 0.05 that means that we reject the null hypothesis and that also means that this this variable right here is actually significant okay which means that actually the length of membership is significant to our linear model so this seems to be working correctly since we have three stars right here that means that it's very significant here here we have the significance codes three stars means very significant and it goes all the way to zero Stars which is not significant we also have the residual standard error and the multiple r squared and you have statistic so I mean this is how you how this is how you measure your the how well fit your model is um so there you have it that's how we have already fit our model um but how about we actually we actually plotted all right so let's let me show you how it looks when plot I'm just going to plot it pretty quick like this I'm going to go yearly amount there and just add length of membership right here there you go um PLT it's not PLT it's plot like that so there you have it I mean this is pretty much just the same graph that we had just a moment ago okay it's nothing new but I'm just going to add the the linear the regression line that we have just that we have just created and to do that I just use this function apline and I just pass in my linear model which is lmfit one the one I stored right here I'm just going to color it red how about that there you go so here you have this is our regression line as you can see we have our intersection at where was it 272 so here is our intersection and the coefficient of our length of membership is 64. uh and that basically describes this line um so that is how you do a linear regression for one single variable um something that you're going to want to do right now is to analyze the residuals because linear regression assumes that the residuals that you get are going to be normally distributed if they are not normally distributed that means that you're probably there's a problem with your model um so what does it I mean what does it mean that the residuals be normally distributed the residuals are the distance I mean you can just see it as the distance between the points and the there is in the the distance between the points and the regression line okay and there are many different ways to actually test if the residual if the sum of I mean if the distribution of the residuals is normal I'm just going to show you two ways of doing it one of them is with a QQ plot and the other one is the the shape here test shapir will test so let's check that out I'm going to add a new section right here hmm and this one's going to be residuals analysis remember this is pretty important for your linear regression analysis you always you're always going to want to check the residuals so the first thing I'm going to do is I'm going to create a QQ plot if you're not familiar with what a QQ plot is I encourage you to check that out and but I mean I'm just going to explain that pretty quick anyways so you're going to want to pass in the residuals here so residuals and that's the function that you're going to want to use residual assist with is sorry about that residuals and you're going to want to pass in the the model that you just trained which is Ellen fit one there you go and that's basically just created a normal QQ plot for us and what does that mean um so basically QQ plots just take the distribution of your data and they divide it into different quantiles um and then you also have a theoretical quantel which is the normal distribution and then you divide um sorry you also have a theoretical distribution which is the normal distribution and you divide this theoretical distribution into the same amount of quantiles and then you just put all the quantiles for the normal distribution right here and then all the quantiles for the for the for your sample distribution right here and then if all of them are actually like if the normal quantel goes with the I mean you plot it you plot the normal quantel with your sample quantel and if all of them are like one to one then that basically means that your distribution is completely normal okay I guess another way of saying it is just by plotting a histogram of it so we can just say hist of your variable which is going to be the residuals of lm.fit1 and then you can see that it's actually pretty normally distributed I mean it looks like so but I mean this is just a graphical way of seeing it um so that's why we also do the QQ the QQ plot to say if it's I mean to be more sure of it and then on top of that we can actually add a QQ line which is going to plot QQ line like this and we're just passing the residuals to of our lm.fit1 there you go and I'm just going to color it red like that there you go and there there you have it so I mean the QQ the QQ plot for a normal distribution basically means that the more your points are aligned with this no with this line the more your distribution is actually normal okay so this one actually starting to look pretty normal and that's the graphical way of testing if you're just if your residuals are normally distributed um if you want to be even more sure of it you can add another test I mean normal normality analysis is just a topic in and of itself and it could take a long time to actually go through it and just I mean there are very numerous and different kinds of tests and not all of them working with all distributions I mean with all data sets but I mean I'm just showing you a couple here um you can also use a Shapiro test a shapir will test and what this one does is that it basically just tells you if your distribution is normal um as well we've we pass in the residuals of our linear model like this LM fit one there you go we run it and there is a problem because I called it shapiri instead of Shapiro there you go we run it and there we have it so the Shapiro normalized test what it does is that it assumes that the distribution is normal okay that's the h0 and then if we then we just run the Shapiro test and if the p-value is lower than 0.05 that means that we'd reject h0 so we reject normality here the p-value is of course over 0.05 so we cannot reject h0 and the normal ending so we keep we keep the h0 hypothesis and we keep the fact that it's like it's actually normally distributed so that was the normality analysis of the residuals um so they look pretty normal and that means that our model is actually looking pretty good um so what what are we going to do right now we're actually going to evaluate the quality of the model by training it because so far we have actually trained it using the entirety of our data but what we're going to want to do now is we're going to divide our data into a training set and into a day into a testing set it's just I mean of all the 500 observations that we have here we're going to take some random observations train our linear model with those and then we're going to create and when you're we're going to test this model and try to predict the values on some data that the model has not seen you before which is going to be our test data and then we're going to see how well we did okay so let's see that I'm going to just add a new a new section right here I'm going to call it um evaluation evaluation of the of the model there you go and let's just set the seed for one right here and let's just generate row number and let's just create a sample from one to n Row the number of rows in data and 0.8 times the number of rows in the data okay so what are we doing right here uh let me tell you so here what we're doing is basically we're just taking a sample from one and I mean we're giving it a range from one to the number of rows in data so the number of rows in data is 500. so we're taking the sample from 1 to 100 and what we're going to be doing is we're going to be creating we're going to be taking 0.8 I mean 80 of that data and this is what we're going to I mean this sample I mean this is just like the list of rows that we're going to be using there for this one okay and it's going to be random so let's run it right here and here we have it it's a it's just a just an array a list of of 400 elements and and it's random okay so that's what I that's what I was telling you about um now we're going to create our train variable with it and for it we're just going to subset our data with this with this new um oh sorry I missed a comma right here there you go so what we have done right here is that from this samp random sample that we have created we just subset our data so now we have 400 L 400 random elements from data in our train data set okay so there you go and then we're going to do pretty much the same but with the test data set and we're going to call it test and we're just going to call it we're just going to soft set again data but here we're just going to add a negative um before the variable right here so we're just going to take exactly the opposite of this one so let's just run that one and you can see that our test is 20 of the data right here and our training set is four is eight eighty percent of our data right here so we have our training set and our test set all right so let's now what we're going to do is estimate the linear fit with the training set um so we're going to do pretty much the same thing that we did before but instead of using the entire set we're going to be using only the the training set so let's do that so we're going to do LM fit lm.fit and before we just called it one but since we're now just using 80 I'm just going to say 0.8 and let's just add it um just as we did before LM yearly amount spent there you go and then that's going to be our response and our our predictor is going to be length of membership length of membership there you go and I mean I I don't think I have it attached uh yet or I'm yeah I think it it's attached still so this should work oh no no that's actually not gonna work because I have still attached I mean the the one that I have attached is the data and I don't want to work with the entire data I just want to work with the trained data set okay so I rerun this and now LM feed 0.8 is trying to get yearly spent from length of membership with the training data set so it's using only 400 observations to train this this data this this model okay let's take a look at the summary and let's see what we can see from it summary lmfit08 there you go so let's see we have a pretty similar um pretty similar result to the one that we got using the entire data set and that is convenient because I mean we're using pretty much the same kind of data but the thing right now is we're going to try to predict um taking the test data set we're going to take every single value from I mean we're going to try to predict using something like this we're going to take every single value of the length of membership and we're going to try to predict the yearly amount spent and we're going to try to find how far off was our prediction okay so let's do that we're going to predict in the test data sets there we go so what we're going to do is try to predict with this model that we have created to run the prediction in our test data set okay so to do this it's pretty it's actually pretty simple what we're going to do is we're going to just create a new variable that's going to let's call it prediction 0.8 because we're predicting with 80 of the data and we're going to use the function called predict like that there you go and this function takes into arguments the first one is going to be your model and the model is the one that we have just trained so it's lmfit 0.8 remember we created it up here and the second argument is going to be the new data that we're going to be using to actually test it and the new data is to test data of course I mean it's the the remaining 20 data that our model has not yet seen so there you go I'm going to run it like that there you go so there you have our prediction 0.8 um there you have it it's just a list of 100 elements um and so I mean it's basically just what it predicts the let me let me show you what what it actually is so what this thing is doing is it's taking the variable length of membership from our test data set and without looking at the actual predict at the actual year leading yearly amount spent it's trying to predict how much the yearly amount spent it um was for that actual observation so it's doing that for every single observation so that's the reason why we have a hundred a list of 100 things right here because it's the 100 predictions for the yearly observations for the hundred observations in the test data set okay and what we're going to do now is we're basically just going to calculate the difference between what was predicted and what was the actual value okay and to do that I'm just going to store that in a new in a new variable called error error 0.8 and I mean it's basically just a difference so it's a it's prediction I mean this value right here prediction 0.8 minus the actual value so since we have a list it's going to calculate the difference between every element in that list um and this one right here is going to be um tick tick tag test yearly amount spend there you go and now we have our value our variable called error 0.8 and it's just a list of 100 elements um of the I mean the difference between those between the between the actual between the actual value and the prediction there you go and now we can actually calculate some um I mean error the error measurement measure measurements um I'm not going to go over what these actually are but I mean just like uh you can look for an explanation of what these are and the diamond they they're used to actually check the how well your model is behaving okay and how well it's predicting its results um so first I'm just going to show you how to calculate the the root mean Square oops root mean square error I mean it's pretty simple root just going to call it root mean square error like this and I'm going to save it it's pretty just pretty much just defined just to square it of the mean of every part of your air squared there you go um yeah there you go so now we have our our root mean square error that was defined and let's also call let's also create an absolute percentage error there you go and that's that was the map I mean absolute percentage error and this one is basically just defined as the mean of the absolute value of air 0.8 between sorry over test and the yearly amount spent um there we go so now we have our our map and our rmsc and let's just take a look at those and how what this looks like rnse equals Army scene not equals r squared let's also take r squared so it's going to it's going to be equal summary Ln dot fit 0.8 check uh squared um pretty sure this is supposed to work um R2 mode what's going on here um summary lm30.8 r squared there you go um there is a problem with this one apparently and what happened here summary hmm it's gonna oh that's quite uh just let me show you what's going on right here so we have our root mean square error which is 44 and we have our mean absolute percentage here of course it since it's a um uh percentage areas between 0.0 and one um there you go so this is pretty much how to calculate how to evaluate your model I mean there are many different methods but there you go those are just just a few R2 I don't know what's going on with our two length object class and R2 modes null um let's just check this out is there a problem with this function right here oops um all right so there seems to be a problem with Alan Ln fed 0.8 r squared multiple r squared all right so we actually have the values I don't know why it's not working there um [Music] no it's it's actually all right well it's a problem here all right sure because I added this one right here it was not supposed to be here it was supposed to be here there you go so now it should work there you go so now you have R2 um so now we have the three the three measurements in here sorry about that so this one this one right here is the actually the one that's actually um creating the the summary and this one right here we're just subsetting r squared there you go so that was how to evaluate the model but as you can see right now the model is I mean it's all right but it's not amazing uh what I'm going to show you now is how to do pretty much the same thing but with the multiple regression so what we did so far is just like a regular linear regression with just one variable but I mean in the real world there is not just one variable that influences your result so in the real world you're going to be using a multiple regression even for the multiple regression even for the simplest cases because I mean the world is not very simple right um let's attach our data again there you go um and then what we're going to do is we're just going to create a new a new model and we're going to call it lmfit and in in this one again we're going to start with yearly amount spent that's our the first argument again it's going to be the response that we that we expect but after I mean remember that the the second part of this of this formula after the TLD is going to be the predictors okay and instead of just adding one single variable that like we did before we added right here we added pretty much just length of membership as a variable and in this one what we're going to do is we're going to add all of the all of the other numerical variables okay so we're going to add average session length then we add plus um I'm just going to add in the same line um added time on app we're going to add time on a website as well and we're going to add a length of membership there you go so now what this one is doing right here is it's training a linear model but it's going to be a multi-dimensional linear model okay so basically our function is going to look more like this one right here so we will not be working with with a single variable linear model but we'll be working with a several variable in your model and we're just going to try to find which variables are actually relevant to our to our response so let's take a look at that um to do that what we're going to do first is well let's just first of all let's take a look at our summary of I mean just with this simple command we have just already created our linear linear model for owl mode for our multiple regression so let's see how that looks like lmfit um there was a problem with this one summary LM fit tag there you go so let's see what this looks like so we have the formula that I showed you we have the response expected and the predictors over here and here we have the residuals as well I mean you you're going to want to do just as we did before a residual analysis of this one too but here as you can see instead of the single predict single coefficient that we had on a pre on the previous linear model here we have several predictors okay and we have several coefficients so we have for example that the average session length has a coefficient of 25 so it's actually kind of quite important I mean it seems to be quite important we have time on Apple 38 and the length of membership of 61. this one seems to be the most relevant one apparently but we also have more information right here we have the standard error we have the T value and importantly as well we have the P value right here what does the p-value tells us tell us um just as we were mentioning before it tells us that if it's lower than 0.05 we reject the null hypothesis and we actually keep the variable as significant the three stars as I was mentioning before indicates the significance of the variable so three stars is very significant zero stars is not significant and as we can see from this multivariable linear model we see that the time on the website is not a very significant variable even though it has even though I mean even though we included it in our linear model we can drop it and it's not going to it's not going to harm our model um so yeah I mean three of the four variables uh studied seem to have a positive impact in the response variable uh positive I mean because all of these coefficients are positive if one of the coefficients was negative that doesn't mean that it's not um relevant but it means that the correlation is it has an inverse relationship with the with our result variable um there you go and the of course time of website that we're going to be able to drop um I guess I can show you how to yeah let's do pretty much the same thing as we did up here I'm just going to copy paste the part of the evaluation model just to just to go pretty fast but I'm just let's just evaluate this multi multiple regression evaluation of the multiple regression as you can as you will see this one right here is actually way more accurate um so again we're going to set the seat to one and we're going to um we're going to be working with the same train and test data set we're not going to recreate our division but we're going to but what we are going to do is we're going to re-re um uh create our linear model for only the training set are multi-variable linear model for only the training set so I'm going to copy this line right here I'm going to paste it right here there you go so what this one right here is going to be doing I'm going to be calling it 0.8 as well and what this one right here is going to be doing is it's going to be again just creating my linear model but the data that we're going to be using for this one um there you go it's going to be the train data set oops train data set what's going on here trying data so there you go um so I'm just going to run this one and let's see how that looks like so as you can see right now it's pretty much the same as we had before because we're using detain the same kind of data but now what the important part right here is going to be the prediction uh so we're going to use this lm508 to predict the new the new data of the test data set okay um you know what I'm just going to change the name of this variable so that it's easier to see I'm going to call it multi-linear model fit just run it I'm going to call it there we go multi linear model fit there you go and then we're going to predict it here I'm going to be using the multi multi linear model as a predictor and the new data is going to be test just as before and let's see how that worked again just as before we're going to be calculating the error and the error for this one is going to be stored in prediction 0.8 minus the actual value in the yearly amount spent let's run that one and again just let's run the rmsc mape there you go and let's just print that out so there you go so here you have the results for the multiple linear regression I hope I didn't lose you in that one I mean we did exactly the same thing as we did with evaluating the linear model for a single simple regression model but we just did it I mean the only it's pretty much exactly the same just that we we added more variables here as the predictors and as you can see the results are different the the error measurements are much better R2 is almost equal to one and that's pretty good because we're working with the test data set and as you can see before we had we had used a multiple regression um where is it here it is this this were our previous values for for the these were our previous values for the rootman square error and for the mean absolute uh percentage percentage error um and yeah for R2 Square for r squared so as you can see the air from the error right here went from forty two dollars to eight dollars the mean absolute percentage error went from 0.07 to 0.01 and R2 went from I mean r squared went from 0.65 to 0.98 so we have we have um improved our model by a lot and yeah I mean that's pretty much how how you run multiple regressions in our um so yeah I mean defining style that by using a multiple linear model we have created a much more accurate predictor of the response of the variable even though we're using a pretty simple mathematical tool we're actually being able to predict much better and to explain much better the the data by using a multiple regression model so R2 went from 0.65 to 0.92 RSC went from 47 to nine dollars so pretty good all right so there you go that was how to create um just how to create a multiple linear regression and how to add how to train a linear regression model in R hope you found it useful um there you go please like And subscribe if you liked it and I'll see you next time [Music] [Music] foreign [Music]

Original Description

Linear Regression is a module used in statistics and data science to find patterns in data. This is an example of how to use R to create a linear regression model with a single variable and multivariable analysis. This is a complete project with examples and a walkthrough of all the steps of the data analysis process. It's far from being the most accurate method, but it's instrumental if some variables of your data are correlated. Here we see linear regression explained with R programming using RStudio. The analysis goes through a data science project that is very easy and simple for beginners to replicate. 🔥 LINKS —————————————————— 👉🏼 Check the code for this video: https://github.com/alejandro-ao/ecommerce-project-r 📖 My Linear Regression Article on Medium: https://medium.com/@alejandro-ao/a-simple-explanation-of-linear-regression-cb6126afe4c2 ⏰ Timestamps: 0:00 Introduction 1:21 Setup and structure 9:05 EDA - plots and insights 20:14 The selected variable for the model 25:44 Fit linear model 32:24 Residual analysis 38:27 Make predictions and evaluate their results 52:30 Multiple regression 57:35 Evaluating the multiple regression 1:02:15 Conclusion
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Alejandro AO · Alejandro AO · 1 of 60

← Previous Next →
Linear Regression in R - Full Project for Beginners
Linear Regression in R - Full Project for Beginners
Alejandro AO
2 Configure Webpack 5 in Wordpress (2025) with Typescript and SASS
Configure Webpack 5 in Wordpress (2025) with Typescript and SASS
Alejandro AO
3 R Programming 101 - Crash Course for beginners
R Programming 101 - Crash Course for beginners
Alejandro AO
4 Convert HTML template to WordPress Theme (2025) - Full Course
Convert HTML template to WordPress Theme (2025) - Full Course
Alejandro AO
5 Javascript Interactive Map with Leaflet EASY (with Marker Clusters & Popups)
Javascript Interactive Map with Leaflet EASY (with Marker Clusters & Popups)
Alejandro AO
6 Vanilla JS Project: Multi Step form in HTML, CSS & OOP Javascript
Vanilla JS Project: Multi Step form in HTML, CSS & OOP Javascript
Alejandro AO
7 How to do AJAX in WordPress correctly (2025)
How to do AJAX in WordPress correctly (2025)
Alejandro AO
8 React Leaflet Tutorial for Beginners (2025)
React Leaflet Tutorial for Beginners (2025)
Alejandro AO
9 Linear Regression in Python - Full Project for Beginners
Linear Regression in Python - Full Project for Beginners
Alejandro AO
10 Logistic Regression Project: Cancer Prediction with Python
Logistic Regression Project: Cancer Prediction with Python
Alejandro AO
11 Display Equations in ChatGPT
Display Equations in ChatGPT
Alejandro AO
12 Create a Chrome Extension (Manifest V3) for ChatGPT
Create a Chrome Extension (Manifest V3) for ChatGPT
Alejandro AO
13 Full-Stack Project | ChatGPT API, React, Node.js, Express
Full-Stack Project | ChatGPT API, React, Node.js, Express
Alejandro AO
14 Streamlit Python Course: Build a Machine Learning App to Predict Cancer
Streamlit Python Course: Build a Machine Learning App to Predict Cancer
Alejandro AO
15 Langchain PDF App (GUI) | Create a ChatGPT For Your PDF in Python
Langchain PDF App (GUI) | Create a ChatGPT For Your PDF in Python
Alejandro AO
16 LangChain Memory Tutorial | Building a ChatGPT Clone in Python
LangChain Memory Tutorial | Building a ChatGPT Clone in Python
Alejandro AO
17 Chat with a CSV | LangChain Agents Tutorial (Beginners)
Chat with a CSV | LangChain Agents Tutorial (Beginners)
Alejandro AO
18 Create a ChatGPT clone using Streamlit and LangChain
Create a ChatGPT clone using Streamlit and LangChain
Alejandro AO
19 Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)
Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)
Alejandro AO
20 Full Python Environment Setup for AI (or other) Apps + Virtual Environments
Full Python Environment Setup for AI (or other) Apps + Virtual Environments
Alejandro AO
21 Langchain + Qdrant Cloud | Pinecone FREE Alternative (20GB) | Tutorial
Langchain + Qdrant Cloud | Pinecone FREE Alternative (20GB) | Tutorial
Alejandro AO
22 LangChain Version 0.1 Explained | New Features & Changes
LangChain Version 0.1 Explained | New Features & Changes
Alejandro AO
23 Create a RAG Chain using LangChain 0.1 (New version)
Create a RAG Chain using LangChain 0.1 (New version)
Alejandro AO
24 Tutorial | Chat with any Website using Python and Langchain (LATEST VERSION)
Tutorial | Chat with any Website using Python and Langchain (LATEST VERSION)
Alejandro AO
25 Deploy Your AI Streamlit App for FREE | Step-by-Step (Heroku Alternative)
Deploy Your AI Streamlit App for FREE | Step-by-Step (Heroku Alternative)
Alejandro AO
26 What is Google's Gemini 1.5 Pro | 10 Million Token Window
What is Google's Gemini 1.5 Pro | 10 Million Token Window
Alejandro AO
27 Chat with MySQL Database with Python | LangChain Tutorial
Chat with MySQL Database with Python | LangChain Tutorial
Alejandro AO
28 Stream LLMs with LangChain + Streamlit | Tutorial
Stream LLMs with LangChain + Streamlit | Tutorial
Alejandro AO
29 Chat with MySQL Database using GPT-4 and Mistral AI | Python GUI App
Chat with MySQL Database using GPT-4 and Mistral AI | Python GUI App
Alejandro AO
30 #1 Harrison Chase: LangChain and The Future of LLM Applications | Alejandro AO
#1 Harrison Chase: LangChain and The Future of LLM Applications | Alejandro AO
Alejandro AO
31 CrewAI Step-by-Step | Complete Course for Beginners
CrewAI Step-by-Step | Complete Course for Beginners
Alejandro AO
32 Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI
Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI
Alejandro AO
33 Build a Web App (GUI) for your CrewAI Automation (Easy with Python)
Build a Web App (GUI) for your CrewAI Automation (Easy with Python)
Alejandro AO
34 Early days of RAG and LlamaIndex - Jerry Liu
Early days of RAG and LlamaIndex - Jerry Liu
Alejandro AO
35 LlamaParse: Convert PDF (with tables) to Markdown
LlamaParse: Convert PDF (with tables) to Markdown
Alejandro AO
36 #2 Jerry Liu - What is LlamaIndex, Agents & Advice for AI Engineers
#2 Jerry Liu - What is LlamaIndex, Agents & Advice for AI Engineers
Alejandro AO
37 CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)
CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)
Alejandro AO
38 #3 Joe Moura | Multi Agent Systems and CrewAI
#3 Joe Moura | Multi Agent Systems and CrewAI
Alejandro AO
39 Python: Create a ReAct Agent from Scratch
Python: Create a ReAct Agent from Scratch
Alejandro AO
40 New Groq Models: Best for Function-Calling Agents
New Groq Models: Best for Function-Calling Agents
Alejandro AO
41 Introduction to LlamaIndex with Python (2025)
Introduction to LlamaIndex with Python (2025)
Alejandro AO
42 LlamaIndex: How to use LLMs
LlamaIndex: How to use LLMs
Alejandro AO
43 LlamaIndex: How to Get Structured Data from LLMs
LlamaIndex: How to Get Structured Data from LLMs
Alejandro AO
44 Multimodal RAG: Chat with PDFs (Images & Tables) [2025]
Multimodal RAG: Chat with PDFs (Images & Tables) [2025]
Alejandro AO
45 Advanced RAG with LlamaIndex - Metadata Extraction [2025]
Advanced RAG with LlamaIndex - Metadata Extraction [2025]
Alejandro AO
46 Learn MCP Servers with Python (EASY)
Learn MCP Servers with Python (EASY)
Alejandro AO
47 Create MCP Clients in JavaScript - Tutorial
Create MCP Clients in JavaScript - Tutorial
Alejandro AO
48 Create an MCP Client in Python - FastAPI Tutorial
Create an MCP Client in Python - FastAPI Tutorial
Alejandro AO
49 How to Build an MCP Client GUI with Streamlit and FastAPI
How to Build an MCP Client GUI with Streamlit and FastAPI
Alejandro AO
50 Vibe Coding For Engineers (make it ACTUALLY work)
Vibe Coding For Engineers (make it ACTUALLY work)
Alejandro AO
51 LlamaExtract Tutorial: Convert PDF & Images into JSON
LlamaExtract Tutorial: Convert PDF & Images into JSON
Alejandro AO
52 Local MCP Servers for Cursor (Step by step)
Local MCP Servers for Cursor (Step by step)
Alejandro AO
53 Anthropic: How to Build Multi Agent Systems
Anthropic: How to Build Multi Agent Systems
Alejandro AO
54 Deploy Remote MCP Servers in Python (Step by Step)
Deploy Remote MCP Servers in Python (Step by Step)
Alejandro AO
55 GPT-5 for Developers: API Changes, Pricing, Model Router & Security
GPT-5 for Developers: API Changes, Pricing, Model Router & Security
Alejandro AO
56 Tutorial: Auth for Remote MCP Servers (Step by Step) | OAuth 2.1 with ScaleKit
Tutorial: Auth for Remote MCP Servers (Step by Step) | OAuth 2.1 with ScaleKit
Alejandro AO
57 Generate UI Tests with TestSprite MCP Server + TRAE
Generate UI Tests with TestSprite MCP Server + TRAE
Alejandro AO
58 #4 Allan Guo | 19-yo YC Founder - Willow Voice
#4 Allan Guo | 19-yo YC Founder - Willow Voice
Alejandro AO
59 RAG Project: Build an AI Onboarding Chatbot with Streamlit, LangChain, and ChromaDB
RAG Project: Build an AI Onboarding Chatbot with Streamlit, LangChain, and ChromaDB
Alejandro AO
60 MCP Security | Malicious MCP Servers (Protect Yourself)
MCP Security | Malicious MCP Servers (Protect Yourself)
Alejandro AO

This video teaches linear regression in R using a full project for beginners, covering data analysis, predictive modeling, and model evaluation. It demonstrates how to use R, ggplot2, and lm function to perform linear regression and evaluate model performance.

Key Takeaways
  1. Import data using read.csv command
  2. Analyze data structure and distribution of variables
  3. Plot data to find correlations and get a feel for the data
  4. Create linear regression model using lm function
  5. Evaluate model using R-squared and residual analysis
  6. Perform multiple regression with multiple variables
💡 Multiple regression can improve model accuracy by incorporating multiple variables, as demonstrated by the increase in R2 value from 0.65 to 0.98 and the decrease in mean absolute percentage error from 0.07 to 0.01.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning

Chapters (10)

Introduction
1:21 Setup and structure
9:05 EDA - plots and insights
20:14 The selected variable for the model
25:44 Fit linear model
32:24 Residual analysis
38:27 Make predictions and evaluate their results
52:30 Multiple regression
57:35 Evaluating the multiple regression
1:02:15 Conclusion
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →