Data Science Portfolio Project: Regression #1 | Data Science with Marco

Data Science with Marco · Beginner ·📐 ML Fundamentals ·5y ago

Skills: Supervised Learning90%ML Maths Basics80%ML Pipelines70%

Key Takeaways

The video demonstrates a data science portfolio project using regression to predict the age of abalone from physical measurements, utilizing libraries such as pandas, numpy, and scikit-learn, and covering data exploration, visualization, and model training.

Full Transcript

hi everyone and welcome to data science with Marco today we're doing our first portfolio project on regression so the goal of this video is to walk you through an end-to-end data science project so from collecting the data to running experiments and reporting the best results this will help you to build a portfolio of data science projects so you can showcase your knowledge and methodology and you will also learn about different aspects of the workflow in data science something that is a bit hard to cover in only one video this project will be separated into two videos so part 1 which is this video we were going to collect the data explore the data and build a baseline model and in the following segment we will run other experiments to try and improve upon the baseline model so if you guys are ready let's get started in the description you should have a link to navigate to the UCI machine learning repository and we'll take a look at the abalone data set so this is the data set we are going to use for our project you can scroll down on this window and you will see that we have the data set information which describes the problem so in this case we want to predict the age of an abalone from physical measurements because typically you need to cut through the shell and then stain it and count the number of rings through a microscope which is according to this page a boring and time-consuming task so instead we want to build a machine learning model to estimate the age of the abalone or the age or the number of rings sorry from different weights that are easy to get such as the lying diameter height hallway etc you can see here with attribute information that we have the name of each variables if it's continuous nominal etc so go up and click on data folder and we are interested on the abalone data file so if you click on it you can save it and place it where you need to my case I won't do it because I already have it in my computer once that is done we can go to our Jupiter notebook as you can see I have a data folder where I put my abalone dot data and we can start our new notebook with Python 3 of course I will simply rename this notebook so I will rename it as portfolio project regression awesome and then I will just put the title in the first cell of the notebook as well it's just a personal preference you don't have to do it if you don't want to now I did take the time to describe the task the problem and the solution as well as to give some information from the data set so this is all information that we saw on the UCI website I think it is a good practice to do that on your notebook simply because if you ever present this project or if someone doesn't know about the project comes to see your notebook then she can have some kind of information about it just from looking at the notebook so let's start off this project by importing our libraries so of course we will need pandas as PD we will need numpy as NP we will need matplotlib pie plot as p LT and we'll add the jupiter magic matplotlib in line we will come back to this cell later on because this is where we will put all of our imports so it will keep the notebook cleaner once these are imported we can import our data set and actually start exploring it a little bit so i will define a variable to the path of my data set which is in the folder data slash abalone data and then we can read it with pandas so data will be equal to PD dot read CSV we pass in the path of the folder we want header to be equal to none because the file actually does not have any errors and index call equal to false now as I said the data file does not have any editor so we don't have the name of the columns so we actually have to set them manually here but that's okay there are not many columns so the first one will be sex then we have the length after it is the diameter then the height after we have the whole underscore weight and then the shocked underscore weight the Becerra underscore weight and finally the shell weight and the last column will be our target which is the number of rings and we can display the first five rows of our data set with data ahead and there you have it so as you can see we have our column names and we also have our categorical variable sex so you see you have male/female and infant and we have the last column rings so if you had 1.5 to this column you should get the age of the abalone and all of that is explained on the UCI website so now we are ready to do some exploratory data analysis or EDA so this is a step where we just take a look at our data set to see if I don't know if classes are imbalanced if we if we have outliers etc so let's go back to our import cell and we will need a new library it's called pandas profiling and from that we will import profile report now this is not a standard cell that comes with anaconda so you will have to install it so for that you need to open your anaconda prompt and then you can do a quick google search of pandas profiling anaconda and it should be the first result on Google and you can see here that to install the package we need to run this command here in the anaconda prompt so simply copy paste the command into the anaconda problem and then press ENTER to run it and install the package now I will not do it because I've already installed it on my computer now once this step is done you can use the library so we'll say that profile is equal to the profile report of data and then you can display the report simply by typing profile and you should get the following output now this does a lot for us automatically not only do we have like a bunch of information about the number of variables number of observations if we have any missing data etc we also have warnings here which tells us that we have variables that are highly correlated with other variables so the ammeter is correlated with length whole weight with diameter shock weight with whole weight and so on and so forth and so those variables should be rejected for analysis because they are correlated with one another so that is actually great we will disregard those variables later on now if you go down you can see a detailed analysis of each variable so here we have sites we know it's a categorical variable and it's fairly balanced and then we also have the length and you can see here toggle details and you have a bunch of statistics both quintiles statistics and descriptive statistics you can have an Instagram of it you can see the most common values and you can also take a look at the extreme values so feel free to pause the video at any time and really go deep down into this report to get to know more about your variables and the day set you're working with and as you can see diameter it's not even considering it because it says that we should disregard it because it's highly correlated with other variables right and of course we have here the number of rings which is our target as you can see we have some heat maps of correlations which further support that we have some correlation in between variables in our data set and then it has the head of the data frame as well now we will do some more exploration on our end will actually go into but a few of those features against the target just to see if what could be a good baseline model so start by setting the size of our plot and we'll simply do a scatter plot of the length so PLT dot scatter x-axis is length y-axis will be data rings and I want the points to be black then I will set the X label as length of abalone in millimeters and the Y label is the number of rings finally you can display the plot like typing PLT dot show and you should get the following plot now as you can see I was thinking of maybe using a linear regression so from this plot we see that there is some kind of a trend but maybe linear regression will not be the best model of course but may be a good approximation we'll have to test it now let's take a look at another scatter plot of an other feature so let's just copy paste the code above and we'll just change the length to the height of the abalone so instead of data length it will be data height and we'll change the X label to height of abalone and it's still in millimeters running this cell you should get the following plot now as you can see we have two outliers in this plot right here on the far right we will see what is the effect of leaving them or taking them out later on when we start modeling but otherwise maybe a straight line would be a fairly good fit in this case at least for a baseline model right now let's do something a bit more interesting and try to plot those two features in the same plot so we'd have to be a 3d plot so we will go back up in our cell of imports and will import now sorry so from MPL underscore toolkits dot m plot 3d we will import X's 3d run this cell and then we can go back down and make our 3d plot so again I will simply define a figure and I will set the size so fixed size is equal to 16 by 8 and that X is equal to P LT dot axes projection will be equal to 3 D now axis dot set X labels on the X we will have the height of the abalone in millimeters and then we'll set the y label which will be the length of the abalone abalone in millimeters again and Zi label will be the number of rings perfect and now we can write ax scatter 3d and so on the x axis is gonna be data height then we put data length and finally if on the z axis we put data rings and again I want the color to be black awesome and now we can show our 3d plot and you should get the following so again we see the two outliers in this 3d plot and we again we can see some kind of trend on both the length and the height if they increase the the abalone tends to be older that is translated by a higher number of rings so that's perfect this is this was just a cool visualization that I guess it would be fun to try it out with you guys so now we are ready to actually build our baseline model for this problem but before we will do some feature engineering because the sex right cannot be letters we have to turn them into numbers and for that we'll simply do one Hut coding and you will see what the result will be in a minute so encoded data will simply be equal to PT to get dummies of data and that's all you have to do and then if you take a look at encoded data now you should get the following so as you can see now we don't have the sex anymore we have sex on the score F on the score I and on the score M so here if it's a male you will get zero zero one if it's a female you'll get a one for under for sex underscore F and if it's an infant you get a one for sex on the score I so as you can see it added some columns to our data set as you can see the shape now is we have eleven covenants so now let's move on to the exciting topic at least according to me which is modeling the first step will be to split our data set into both a dev set in a way that we'll use to train and to test and we'll have a validation set as well so I will say that the training set will be equal to encoded data dot I lock and we will take everything up to the index 4099 and the validation set will be the last 77 examples this is a bit arbitrary usually you can say you can do this plate of 80% 10% 10% in my case because we don't have a lot of data I just figured that 77 will be a good validation set then you can print the shape just to make sure so now you see if you add 4099 to 78 you should get 4100 77 it's just a to make sure that we did everything correctly so now that we have our training set and validation set we will train our baseline model which will be a simple multiple linear regression simple because the model in itself is simple but a multiple linear regression because we will use more than one feature right so before we move on we need to go back up and we are going to import some libraries for modeling so from SK learn dot linear model we will import linear regression perfect then from SK learn dot matrix we will import the mean squared error actually we will use the root mean squared error as our metric to evaluate our models because this is a regression problem and then from SK learn dot model underscore selection well import train underscore test underscore split perfect make sure to rerun the cell every time you make new imports so that they are available in the notebook so the first step here will be to define our features so X will be equal to the training set and now we'll only take the length and the height as a feature like so and our target variable of course will be the number of rings so its training set and you select the Colin rabies dot values dot reshaped passing - 1 1 awesome now we will also use our validation set so X underscore Val will be the validation set and again you select the same columns as above so length and height now this validation set will be useful for us to kind of simulate how our model would perform on unseen data so you train the model and then you deploy it and you have some new data that you've never seen before how would we with the model perform and of course why Val is validation set rings that I used are bishop now we are ready to do our split so X trained X test y train Y test is trained test split pass in your eggs pass in your Y and the test size will be equal to 10% and you also set a seed for the random state so that you always get the same results every time you run the notebook and you set it equal to 42 the answer to everything run the cell and awesome so now let's initialize our model so Lin underscore rank for linear regression is equal to linear regression so the initialize the model and then we will now fit it so linreg dot fit and you pass in X train white train and now after this line the model is fitted so now we can actually compute the test root mean square error so for that we write Lin reg pred will be equal to doing red dot predict so we make predictions with our fitted model and we predict on X tests and then the tests are MSE will be the mean squared error and then you pass in what are you comparing it with so because we predicted on X test we compare it with y test and then you pass in your predictions which is linreg pred and finally we'll set the parameter squared equal to false so that way we get the root mean squared error and then we'll be able to print it so test our MSE is equal to test our MSE the variable in this case perfect and now we'll take a look at our validation root mean squared error so at this point feel free to pause the video and maybe try it on your own as a little exercise because the code will be very similar so linreg pred underscore vowel for validation will be equal to the predictions but this time on X underscore vowel and then the validation our MSC is just like above mean squared error we pass in Y Val and predictions on the validation set of course squared is equal to false so that we get the RMS C and as above you can print it so the validation or MSE is equal to our MSE once you run this cell you see that our test our embassy is 2.57 approximately and the validation is 1.57 so that's it for this video we have our baseline model and we explored the data in the next one we will improve on this model and run even more experiments see you on the next one guys

Original Description

Part 2: https://www.youtube.com/watch?v=bbwIG0kXxhM Dataset: http://archive.ics.uci.edu/ml/datasets/Abalone Full project notebook: https://github.com/marcopeix/datasciencewithmarco/blob/master/portfolio_project_regression.ipynb In this video, we walk through the first part of a project to start off or to add to your data science portfolio. The objective is to build a machine learning algorithm to predict the age of abalone from physical measurements only. In the first part, we collect the data, explore it and build a baseline model.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Science with Marco · Data Science with Marco · 7 of 38

← Previous Next →

Linear Regression in Python | Data Science with Marco

Linear Regression in Python | Data Science with Marco

Data Science with Marco

Classification in Python | logistic regression, LDA, QDA | Data Science With Marco

Classification in Python | logistic regression, LDA, QDA | Data Science With Marco

Data Science with Marco

Resampling and Regularization | Data Science with Marco

Resampling and Regularization | Data Science with Marco

Data Science with Marco

Decision Trees | Data Science with Marco

Decision Trees | Data Science with Marco

Data Science with Marco

Suppor Vector Machine (SVM) in Python | Data Science with Marco

Suppor Vector Machine (SVM) in Python | Data Science with Marco

Data Science with Marco

Unsupervised Learning | PCA and Clustering | Data Science with Marco

Unsupervised Learning | PCA and Clustering | Data Science with Marco

Data Science with Marco

Data Science Portfolio Project: Regression #1 | Data Science with Marco

Data Science Portfolio Project: Regression #1 | Data Science with Marco

Data Science with Marco

Data Science Portfolio Project: Regression #2 | Data Science with Marco

Data Science Portfolio Project: Regression #2 | Data Science with Marco

Data Science with Marco

What Are Time Series - Applied Time Series Analysis in Python and TensorFlow

What Are Time Series - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

Basic Statistics - Applied Time Series Analysis in Python and TensorFlow

Basic Statistics - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

Autocorrelation and White Noise - Applied Time Series Analysis in Python and TensorFlow

Autocorrelation and White Noise - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

Stationarity and Differencing - Applied Time Series Analysis in Python and TensorFlow

Stationarity and Differencing - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

Random Walk Model - Applied Time Series Analysis in Python and TensorFlow

Random Walk Model - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

Moving Average Process - Applied Time Series Analysis in Python and TensorFlow

Moving Average Process - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

Autoregressive Process - Applied Time Series Analysis in Python and TensorFlow

Autoregressive Process - Applied Time Series Analysis in Python and TensorFlow

Data Science with Marco

ARMA Model - Time Series Analysis in Python and TensorFlow

ARMA Model - Time Series Analysis in Python and TensorFlow

Data Science with Marco

What is data science?

What is data science?

Data Science with Marco

Answering DATA SCIENCE questions #1 - Why learn SQL when Python and R exist?

Answering DATA SCIENCE questions #1 - Why learn SQL when Python and R exist?

Data Science with Marco

R vs Python in the Industry - Data Science Q&A #datascience #datasciencecareer #careeradvice

R vs Python in the Industry - Data Science Q&A #datascience #datasciencecareer #careeradvice

Data Science with Marco

Data science or data engineering - which is best for you? #datascience #datasciencecareer

Data science or data engineering - which is best for you? #datascience #datasciencecareer

Data Science with Marco

Where to find data for data science projetcs? #datascience #datasciencecareer

Where to find data for data science projetcs? #datascience #datasciencecareer

Data Science with Marco

Data science certificates on resume? #datascience #datasciencecareer #careeradvice

Data science certificates on resume? #datascience #datasciencecareer #careeradvice

Data Science with Marco

Should you aim for data science or data engineering? | Data Science Q&A #1

Should you aim for data science or data engineering? | Data Science Q&A #1

Data Science with Marco

Don't waste time on this | #datascience #datasciencecareer

Don't waste time on this | #datascience #datasciencecareer

Data Science with Marco

Low-code AI tools - are they good? | #datascience #datasciencecareer #careeradvice

Low-code AI tools - are they good? | #datascience #datasciencecareer #careeradvice

Data Science With Marco

How to grow as a data scientist after 2+ years of experience? #datascience #datasciencecareer

How to grow as a data scientist after 2+ years of experience? #datascience #datasciencecareer

Data Science with Marco

Transition into DATA SCIENCE without a masters or bootcamp #careertransition

Transition into DATA SCIENCE without a masters or bootcamp #careertransition

Data Science With Marco

How to improve your data science profile?

Data Science With Marco

How to learn Python for data science?

How to learn Python for data science?

Data Science With Marco

Does Scrum/Agile work for data science?

Does Scrum/Agile work for data science?

Data Science With Marco

What are the major roles in analytics and how to choose?

What are the major roles in analytics and how to choose?

Data Science with Marco

Thoughts and advice for a live SQL coding round

Thoughts and advice for a live SQL coding round

Data Science With Marco

Data science interview question: difference between type 1 and type 2 error

Data science interview question: difference between type 1 and type 2 error

Data Science With Marco

Feature selection in machine learning | Full course

Feature selection in machine learning | Full course

Data Science With Marco

Anomaly detection in time series with Python | Data Science with Marco

Anomaly detection in time series with Python | Data Science with Marco

Data Science With Marco

Podcast - TimeGPT, predicting the future, and more

Podcast - TimeGPT, predicting the future, and more

Data Science With Marco

Big announcement - Revealing my new book

Big announcement - Revealing my new book

Data Science With Marco

Get Started in Time Series Forecasting in Python | Full Course

Get Started in Time Series Forecasting in Python | Full Course

Data Science With Marco

This video teaches how to build a regression model to predict the age of abalone from physical measurements, covering data exploration, visualization, and model training, and provides hands-on experience with popular libraries like pandas and scikit-learn.

Key Takeaways

Collect data from the UCI machine learning repository
Explore data by reading the CSV file with pandas
Set column names manually in the pandas dataframe
Install and use the pandas profiling library for exploratory data analysis
Create a scatter plot of length vs number of rings
Split data into training, validation, and test sets
Define features and target variable
Fit a simple multiple linear regression model
Compute test and validation root mean squared error

💡 The video highlights the importance of exploratory data analysis and data visualization in building a robust regression model, and demonstrates how to use popular libraries like pandas and scikit-learn to implement these steps.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

What is K-Nearest Neighbors?

What is K-Nearest Neighbors?

Related Reads

Automatic Relevance Determination Regression for Time Series Forecasting

Learn how to apply Automatic Relevance Determination Regression for accurate time series forecasting

Medium · Data Science

Deploying Multi-Turn RL Infrastructure for Amazon Nova on Amazon SageMaker HyperPod

Deploy a multi-turn RL infrastructure on Amazon SageMaker HyperPod using Amazon Nova Forge and create an event-driven pipeline for automated training

AWS Machine Learning

Python for Data Science — Sampling and Why Your Conclusions Can Be Wrong

Learn how sampling affects data science conclusions and why understanding probability distributions is crucial

Medium · Machine Learning

From a Student Project to an ICML Spotlight

Learn how a student project can lead to an ICML spotlight and understand the importance of efficient GPU computing in machine learning research

Medium · Machine Learning

Dropout in Deep Learning