Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2

Harshit Tyagi · Beginner ·📐 ML Fundamentals ·5y ago

Key Takeaways

This video tutorial demonstrates data preparation using Sci-kit learn and Pandas, covering techniques such as one-hot encoding, missing value imputation, and pipeline creation for an end-to-end machine learning project.

Full Transcript

hello everyone welcome back to the channel so this is the second part of the end to end machine learning project series now up until now we have seen uh exploratory analysis what features were important now in this particular video we are going to create pipeline we are going to see first of all this simple computer class to handle missing values then we are going to look at one hot encoding of all the categorical variables then the third thing that we are going to do is we are going to create pipelines basically set up our transformations in a pipeline uh using the pipeline class of psychic learn and then we're going to automate all of these steps and use the base class to write some custom transformations to add the new variables that we saw were really important that turned out to be really important for us like the acceleration on cylinder and acceleration on power so we're going to look at all of these transformations and how to automate all of these process so at the end basically you would be passing just the data frame to that pipeline and the transformation pipeline and what you would get at the end is the prepared data which would be ready to go into your machine learning model uh to train it and to get predictions out of it so let's get started [Music] so in the second part we are basically focusing on data preparation there are mainly four steps that we have to perform the first one is to handle categorical attributes uh which is we are going to see one hot encoder we have a class in scikit-learn library we can use it directly then we're going to focus on data cleaning uh using imputing method uh how we can use the simple computer class from scikit learn again and so we basically using all of the uh you know pre-built classes and libraries like scikit-learn because it makes it easier for us to automate all those steps and it is uh basically we don't have to you know tweak all those changes to make it compatible so you use focus on using these classes that are available for us to take care of all those uh transformations that we have just seen in part one uh where we saw how we can uh impute the values with medians all of those things are available in classes uh of scikit learn so we're going to see how we can use them then the third step is attribute addition so here we will have to do some manipulation on the custom class basically the base on will build on top of the base class of the scikit-learn library we'll see how we can uh use the transform transformer mixin uh class to build on top of that to add some attributes that we want to add there was two if you remember then setting up data transformation pipeline for numerical and categorical columns so we'll be doing uh the numerical transformation numerical attribute transformation and categorical attribute transformation side by side first we'll do them separately and then we are going to wrap them up in one particular pipeline so all of those tasks would simply be automated so all we have to do is simply pass the data and it will transform or both of those uh kind of attributes and we'll have the prepared data with us so uh starting off with i have first imported the general use libraries so this is part two so you can basically uh get hands-on with this particular notebook by cloning the repository uh the link is in the description so let's run this all right and again i am reading the data again so basically doing all of the steps and right after reading the data i am creating a training and test split so that right at the beginning i am setting aside my testing data so that i don't look at it all right so i have my training and testing set separate now another thing that i'll have to do is since i have target variable as well as my feature variables uh all in one data frame so i will segregate them so uh in my data uh variable i'll have only the features and i'll drop the mpg the target variable from it and in the data labels variable i am storing my target variable so let's run this so if you want to look at data now so this is basically what the data looks like without the mpg column all right so this is what we have so we have segregated the target variable now all we are left with is if you want to count so we have one two three four five six seven so we have seven attributes here that we have to work with so target variable was one which we have dropped all right so uh the next function that we have to do first thing is pre-process the origin column so we have one two three these are the classes so we are going to convert or change their names as we saw in part one to india usa and germany so i have written this pre-process function so you should focus on this now the origin column basically i what i'm doing is i have created a function around it so this origin so this map function basically what it does it takes the dictionary it takes one origin row and what it does is it maps the keys which is one two and three and maps it to the value which is india usa and germany so it does that for all the rows and gives us the new data frame so i have passed my data the feature variables to this particular function the origin column was already present and if we run this so you see that the origin column has been pre-processed and all of those one two three classes have been now they are renamed and now if you look at this so i've called this data transform so you should keep an eye on the variables that i'm using data underscore tr this is data transform basically so uh pre-processed data so if now i look at the info data dot info which gives us the information of the data frame you see i have 314 rows so there are some missing values that we saw we have six missing values here and the origin the origin column uh is now converted to object data type that we'll have to address which basically tells us that this is a categorical column uh qualitative variable that we'll have to deal with so that's all okay that's fine now uh what we are going to do is we are going to first isolate the origin column so all i've done is so since this is the only categorical variable uh what you do is you segregate all the categorical variables so here i have segregated this categorical variable aside and if you want to take a look at this so you see this data underscore cat this is my separate data frame which contains only one variable which is origin all right now what we're going to do is we are we're going to one hot encode this particular categorical attribute that we have now one hot encoding the categorical values so what we are doing here is you could have used a dummies function from pandas but since we're treating the pipeline in scikit learn so we it's better to use this one hot encoder class which gives us a sparse matrix now what you do is first i've imported the one hot encoder class from the pre-processing module now the next step that you do is you instantiate this class which is one hot encoder i have instantiated it and created an object which is cat encoder now this cat encoder function it has this fit transform function now this fit transform function uh basically it takes that categorical variable it computes all the classes that are available so here we have india usa and germany and it will create one hot vectors for us so it will return a sparse matrix now this sparse matrix is basically let's say you have only three classes in this particular data set but if you would have had let's say you know uh 100 classes then you it would have created 100 hundreds of columns and if if i if you had let's say thousand or tens of thousands of uh classes then it would have created those many columns with so many zeros and only a few uh ones so for that we it takes up a lot of space on your memory and the sparse matrix what it does is it condenses it compresses those uh it compresses that data and gives you that simple matrix to work with and if you want to convert this into an array to basically look at it to a numpy array to make the computations a little more easier and easier to visualize as well so you can you do this invoke this two array function on this uh sparse matrix from the scipy class so if i run this you see that these are just the five top five rows that i'm showing you but it has done this for all the 318 rows that we have in the training set so this is the data categorical one hot sparse matrix i have con invoked the two array function it has converted it into a 2d array for us so the categories that we have for this cat encoder uh object so you can it tells you the these are the classes uh germany india usa these are the three classes that we have so it basically tells you the categories that it has converted uh and basically converts the entire classes into one hot vectors all right uh so we have looked and uh transformed our categorical variable but we are going to see how this would this step basically would be uh you know taken into one particular statement in a pipeline so i've just shown you that this is this is what we are going to do so i'm going to i'm showing you that this is what is going to be happening in the back end and we'll we're going to build and automate this stuff uh soon so now the next step is handling the missing values using simple computer so we saw that we have 314 rows in the horsepower column which is which has six rows missing so what we are going to do here is we are first going to segregate the numerical columns so what i'm i've done is i've created this numerical data now data dot i log so we are segregating here uh all the numerical variables except the one so i've used the i lock method capturing all the rows the first slice basically is for the rows and the second slice is telling you that i don't want the last column i want all the ones that are before my origin column so i have captured the other method the better method to capture all the numerical column we're going to see that in a bit so for now uh let's let's look at what we get so you see that origin from the original data we have segregated the only the numerical columns so we have six columns here uh cylinder displacement horsepower weight acceleration and model year so these are the variables that we have numerical columns now to impute the missing values what we are doing is we are firstly importing the simple imputed class from the impute module of the scikit-learn library now which value should we use to impute which method so basically simple computer class gives you a few strategies that you can use you can use mean you can use median there are many other methods one thing that you should i highly recommend that you do is you read the documentation and read the examples that the psychic learn that are given on the scikit-learn page the documentation again i'll add the link in the description so do read all of these documentation these are like really well documented classes and function that you should definitely go through every now and then i keep doing that every now and then so all right so we have created the imputer which is instantiating simple computer strategy we're going to use median as we saw in the first part now all i need to do is invoke the fit method provide the data so this numerical data num data is contains all the numerical attributes so what it gives you it gives you a simple imputer object now what you could what this input contains so if we take the statistics so you see that these are the medians of all the six columns that we have these are six values these are the medians of all the six columns that we have now if we compare it with the pandas method of calculating the median so you'll see that both of those uh both of them are same so you could have simply use the median the other method that we saw in part one that is also fine to impute the values now to finally run uh replace or fill those missing values you invoke the transform method pass the numerical data which is the numerical data frame and i have stored it in x so what it gives you is it gives you a 2d numpy 2d array number nd area basically object and we can see that these are basically all imputed now to create a data frame because it's easier to basically look at data frames so what we are doing is we are creating a data frame of this transform numerical attributes passing on the same names of the columns and the same index as in the numerical data frame but this time we'll have no missing value so you see that horsepower has now 318 rows as well so we had 314 here uh earlier but now we have 318 values so four were missing i think two of the missing values were are in testing data set now so that's good uh when we test data we basically walk it through the first preprocessing and all of those things so we'll see how we're going to do that all right so now missing values are also added all right so the next thing that we have to do is we have to add the attributes that we decided in part one so we decided that we are going to add acceleration on power and acceleration on cylinder so what we are going to do is we are going to use the base estimator class so the scikit-learn library has this base estimator class on which you can define your own methods you can override those transform and fit transform methods and the transformer function basically the transformer makes in class it provides option to it it is basically a mixing class that allows you to build all sorts of transformers uh that you would want to add so uh i have the numerical data head here so we see that the index of the list of columns that we have in the numerical data is 0 1 2 3 4. so the fourth index is of acceleration zeroth index is cylinder and uh we need horsepower for acceleration on power so horsepower is two all right so we see that in this particular class what i've done is i have imported the base estimator class and the transformer mixing class which is what we are going to inherit to create our own class and then we will basically override a few methods so what i have here is i have acceleration index which is four horsepower index which is two and then i have cylinder index which is zero which is the first column all right so one thing that we have to take care of is we will be providing data frames but uh the base estimator and the transformer these operate on nd areas so they'll be converting that uh data frame that we provide into nd areas and it will also generate an nd array so it will basically return a 2d array to us rather than a data frame which would further we can use the same nd area that they provide into our machine learning model now if you see i have defined this custom attribute adder class i have inherited the base estimator class and the transformer mixin class now the constructor function i have provided so acceleration on cylinder is basically what we need but acceleration and power i have passed it as true but we can also basically automate this i mean like you can provide the flexibility of not adding this function or not adding this attribute so acceleration on power i have added it as true so once you have it so if you provide like false uh to this particular method to the class while you are instantiating it so it will not add the acceleration on power function attribute but uh moving on so i have the fit function here we do not have to change anything then the main thing that we have to do is in the transform function now here what we are doing is we first need to divide the acceleration uh by the cylinder values so uh we have that we have this 2d array which is x so we have this x is basically the 2d array converted the numerical data has been converted into 2d array which is this x which we will provide to this function now once you pass on this value so what it will do is where it will simply uh take all the rows of the acceleration in the uh column and then divide it divide all those rows by the cylinder column values and if the acceleration on power basically this function this particular attribute is true then we will what we'll do is we'll also calculate the acceleration on power and this is acceleration indices divided by the uh horsepower values all right so uh we have got our acceleration on power values which is uh the result of the division and we also have the acceleration on cylinder so both of these uh we have and we have our original 2d array now to combine these uh into one particular 2d array we use this np.c underscore uh method from uh numpy library and what it will do is it will concatenate the two arrays that are that have just been created which is acceleration on power and excellent cylinder and it will concatenate it along the second axis of the 2d array which is the x 2d array and it will add those two columns at the end we'll have a complete data frame with the two data two attributes added uh now moving on so this is if uh the if the acceleration on power is not true then we'll simply just add it with the same data frame only the acceleration on cylinder will be added to the main data frame now if we run this we see that we have six columns here and now after adding these two we should have eight values and you can see that we have eight values here in our numerical data so these last two values 4.75 and 0.3 these are the values that acceleration and power and acceleration on cylinder uh have been added now next thing next task is to basically create a pipeline out of it so currently we have seen all the individual elements that will go uh so for the categorical variable for the numerical variable now it's time to combine them and create an entire pipeline so that we pass the data frame and it gives us the prepared data now for that we have this pipeline class from the cyclic learn pipeline module and the other method that we are going to add is the standard scalar basically it's the best practice to always scale your values always scale your numerical values so first we have segregated our numerical column so this is basically what i was talking about a better method to segregate all the numerical columns so here in the numerical data we have selected all the numerical attributes and the categorical attribute is basically we have segregated it uh so this is basically the numerical pipeline so i have segregated numerical columns this is my numerical data frame now next thing i do is i create a pipeline now in this pipeline what are the main three tasks that i have to perform first i'll have to impute the missing values so i have passed on the simple computer with strategy median so you pass on the name of that particular method then i have the attribute adder which is the custom attribute adder that i have created this class but basically and then i have the standard scalar which is simply the standard scalar method that is their the standard scalar class object so i provided all the class objects all the objects in this pipeline that needs to be transformed and then this pipeline method would be called the invo the transform method of this pipeline would be called uh and we'll pass on the data frame that we have so numerical data frame and this will generate the with the transform numerical data frame for us so you see that all of the values are now scaled these are all eight values along with uh the two new attributes that we added so we see that we have custom attribute adder standard scalar and simple computer all of those in a particular cascaded uh pipeline so the next thing that we do now we can add so we have this is basically our numerical pipeline the next thing that we do is we can also add the categorical processing if you have let's say multiple categorical variables you might have to create a pipeline for them as well separate so here firstly all we need to do is simply segregate i'm creating a list of numerical columns then for categorical attributes we only have origin now to create a full pipeline we you are using the column transformer so that it basically transform different columns or subsets uh using uh this class of column transformer from the compose uh module of the cyclic learn library so again look at the data uh documentation of these classes so you get a better understanding of how these things are taking the arguments and processing them one by one so i have column transformation column transformer which i have instantiated as my full pipeline this full pipeline would take numerical and categorical two types of data frames now num is basically i'm call i will call the numerical pipeline so this numerical pipeline is i have defined over here and i have to pass the columns which is the numerical columns in my data frame and similarly for the categorical all i need to do is i need to one hot and code those values so this is my categorical attributes over here i have provided so this will basically convert or transform both of my numerical as well as my categorical attributes all i need to do is call my fit transform method on my data and it will give us the results if we look at the first value of this particular prepared data so you see that we have these are the one hot encoded values of the categorical column and these eight basically we have 11 attributes over here so uh this is these are the basically the values uh for the prepared data now this is now ready to go into our machine learning model so that was all about data preparation how we prepared data using a few cyclic learn classes we have learned about how to create pipelines how to create custom estimators transformers using the base class uh base estimator class and transformer mix and class then we learned about how to club numerical transformation as well as categorical transformation in just one pipeline using the column transformer uh class and we at the end we saw that we can simply provide the data to our raw data to our methods and it will simply generate do all the preprocessing and generate prepared data which is ready to use for our machine learning models now in the next video we're going to deep dive into machine learning we're going to create some functions for these uh these methods and these uh pipeline methods that we have created over here and then we're going to deep dive into a few algorithms train them look at hyper parameter tuning look at cross validation and at the end we're going to finalize one model which would be ready to deploy and monitor so i'll catch you in my next video

Original Description

Part 2 of the End-toEnd ML Project Series - The series will cover everything from Data Collection to Model Deployment using Flask Web framework on Heroku! - Link to the Dataset: http://archive.ics.uci.edu/ml/datasets/Auto+MPG GitHub Repository: https://github.com/dswh/fuel-consumption-end-to-end-ml - My Task CheatSheet: https://towardsdatascience.com/task-cheatsheet-for-almost-every-machine-learning-project-d0946861c6d0 - Video on Data Science Portfolio: https://www.youtube.com/watch?v=_ANbV9lVA-M Book for basics of Machine Learning: http://themlbook.com/wiki/doku.php You can connect with me on: - LinkedIn: https://www.linkedin.com/in/tyagiharshit/ - Medium where I -write: https://medium.com/@harshit_tyagi - Twitter: https://twitter.com/tyagi_harshit24
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Harshit Tyagi · Harshit Tyagi · 20 of 60

1 Your PATH to learning Data Science
Your PATH to learning Data Science
Harshit Tyagi
2 Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.
Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.
Harshit Tyagi
3 Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.
Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.
Harshit Tyagi
4 Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub
Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub
Harshit Tyagi
5 Python fundamentals for Data Science - Part  1 | Data types | Strings | Lists
Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists
Harshit Tyagi
6 Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions
Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions
Harshit Tyagi
7 Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules
Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules
Harshit Tyagi
8 NumPy Essentials for Data Science - part-1 | One Dimensional Array
NumPy Essentials for Data Science - part-1 | One Dimensional Array
Harshit Tyagi
9 NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array
NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array
Harshit Tyagi
10 Math For Data Science | Practical reasons to learn math for Machine/Deep Learning
Math For Data Science | Practical reasons to learn math for Machine/Deep Learning
Harshit Tyagi
11 Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy
Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy
Harshit Tyagi
12 Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science
Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science
Harshit Tyagi
13 Python vs R | The BEST programming language for your Data Science Project
Python vs R | The BEST programming language for your Data Science Project
Harshit Tyagi
14 Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy
Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy
Harshit Tyagi
15 The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account
The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account
Harshit Tyagi
16 Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey
Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey
Harshit Tyagi
17 Speeding up your Data Analysis | Hacks & Libraries
Speeding up your Data Analysis | Hacks & Libraries
Harshit Tyagi
18 How to build an Effective Data Science Portfolio
How to build an Effective Data Science Portfolio
Harshit Tyagi
19 End-to-End Machine Learning Project Tutorial - Part 1
End-to-End Machine Learning Project Tutorial - Part 1
Harshit Tyagi
Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2
Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2
Harshit Tyagi
21 Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3
Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3
Harshit Tyagi
22 Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4
Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4
Harshit Tyagi
23 Three Decades of Practising Data Science | Interview with Dean Abbott
Three Decades of Practising Data Science | Interview with Dean Abbott
Harshit Tyagi
24 Calculating Vector Norms - Linear Algebra for Data Science - IV
Calculating Vector Norms - Linear Algebra for Data Science - IV
Harshit Tyagi
25 Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow
Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow
Harshit Tyagi
26 Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N
Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N
Harshit Tyagi
27 Building projects with fastai - From Model Training to Deployment
Building projects with fastai - From Model Training to Deployment
Harshit Tyagi
28 October AI - Video Calling with One-Tenth of Internet Bandwidth
October AI - Video Calling with One-Tenth of Internet Bandwidth
Harshit Tyagi
29 November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...
November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...
Harshit Tyagi
30 Data Science learning roadmap for 2021
Data Science learning roadmap for 2021
Harshit Tyagi
31 Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra
Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra
Harshit Tyagi
32 Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)
Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)
Harshit Tyagi
33 Tableau vs Python - Building a COVID tracker dashboard
Tableau vs Python - Building a COVID tracker dashboard
Harshit Tyagi
34 [Explained] What is MLOps | Getting started with ML Engineering
[Explained] What is MLOps | Getting started with ML Engineering
Harshit Tyagi
35 Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science
Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science
Harshit Tyagi
36 Five hard truths about building a career in Data Science
Five hard truths about building a career in Data Science
Harshit Tyagi
37 Computing gradients using TensorFlow | Training a Linear Regression model from scratch.
Computing gradients using TensorFlow | Training a Linear Regression model from scratch.
Harshit Tyagi
38 Foundations for Data Science & ML - First steps for every beginner!
Foundations for Data Science & ML - First steps for every beginner!
Harshit Tyagi
39 Course Outline - Foundations for Data Science & ML
Course Outline - Foundations for Data Science & ML
Harshit Tyagi
40 How Machine Learning uses Linear Algebra to solve data problems
How Machine Learning uses Linear Algebra to solve data problems
Harshit Tyagi
41 Calculus for ML - How much you should know to get started
Calculus for ML - How much you should know to get started
Harshit Tyagi
42 Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking
Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking
Harshit Tyagi
43 AI Engineer - The next big tech role!
AI Engineer - The next big tech role!
Harshit Tyagi
44 AI researcher vs AI engineer | The next big tech role!
AI researcher vs AI engineer | The next big tech role!
Harshit Tyagi
45 Reviewing LLMs for content creation
Reviewing LLMs for content creation
Harshit Tyagi
46 Building a chatGPT-like bot on WhatsApp #coding  #chatgpt #engineering
Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering
Harshit Tyagi
47 High Signal AI - the most action-oriented newsletter on the web! #ai
High Signal AI - the most action-oriented newsletter on the web! #ai
Harshit Tyagi
48 Building an AI-powered Discord Chatbot Locally for FREE using Ollama
Building an AI-powered Discord Chatbot Locally for FREE using Ollama
Harshit Tyagi
49 Build a second brain with Khoj 🧠  #ai #obsidian #plugins #productivity #engineering #notes
Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes
Harshit Tyagi
50 Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2
Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2
Harshit Tyagi
51 Watch the full video on my channel - Roadmap to become an AI Engineer.
Watch the full video on my channel - Roadmap to become an AI Engineer.
Harshit Tyagi
52 Mesop - Python-based UI framework from Google!
Mesop - Python-based UI framework from Google!
Harshit Tyagi
53 How I automated my YouTube | Gumloop tutorial | No Code
How I automated my YouTube | Gumloop tutorial | No Code
Harshit Tyagi
54 ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark
ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark
Harshit Tyagi
55 Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases
Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases
Harshit Tyagi
56 Claude #AI artifacts are just amazing!
Claude #AI artifacts are just amazing!
Harshit Tyagi
57 OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
Harshit Tyagi
58 Day in my life | Vlog #1
Day in my life | Vlog #1
Harshit Tyagi
59 How to add AI Copilot to your application using CopilotKit | Tutorial
How to add AI Copilot to your application using CopilotKit | Tutorial
Harshit Tyagi
60 Quick Questions with an AI Founder - Anudeep Yegireddi
Quick Questions with an AI Founder - Anudeep Yegireddi
Harshit Tyagi

This video tutorial covers data preparation techniques using Sci-kit learn and Pandas, including one-hot encoding, missing value imputation, and pipeline creation. It demonstrates how to apply these techniques to prepare data for machine learning models.

Key Takeaways
  1. Read data from file
  2. Create training and test split
  3. Segregate features and target variables
  4. Drop target variable from features data frame
  5. Store target variable in separate data frame
  6. Apply one-hot encoding to categorical variables
  7. Impute missing values using SimpleImputer
  8. Create pipeline using Pipeline class from Sci-kit learn
  9. Scale numerical values using StandardScaler
💡 Using Sci-kit learn and Pandas for data preparation can simplify the process of creating machine learning pipelines and improve model performance.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →