How do I encode categorical features using scikit-learn?

Data School · Beginner ·📐 ML Fundamentals ·6y ago

Key Takeaways

This video demonstrates how to encode categorical features using scikit-learn's OneHotEncoder and ColumnTransformer, and how to use these tools in a pipeline for machine learning model building and cross-validation.

Full Transcript

next question from Vishwanatha I was wondering if you could help me to understand the process of building a pipeline using the scikit-learn pipeline module it would be great if you could use scaler and one hot encoder as part of the tutorial as well this is a great question so what is the point of pipeline the point of pipeline is to chain steps together sequentially normally you put pre-processing steps and model building steps in a pipeline now why should you build a pipeline there are two main reasons and these reasons will become more clear after I go through this lesson at least one of them will the two reasons you use pipeline number one it allows you to properly cross validate a process rather than just a model in other words when you're doing cross validation like cross valve normally you just pass a model object to it well there are cases when that is not going to give you good accurate results because you're doing the pre-processing outside of the cross validation so pipeline generally speaking is useful because you can cross validate a process that includes pre-processing as well as model building the other reason pipeline is useful is because you can do a grid search or a randomized search of a pipeline which allows you to do a grid search or randomize search of both tuning parameters for model and the pre-processing steps normally when you use grid search you probably think of it as I'm going to do a search of parameters for a model but sometimes you want to do a search of parameters for pre-processing steps in combination with the model what I mean by that is like let say I want to change check different values of the C value for logistic regression and I want to check different values of the strategy for missing for imputed missing values you can do a grid search that allows you to check both of those at once okay now I've got a notebook and I'm gonna try to code it in real time if I get really behind I'll maybe copy and paste some stuff but this is a humongous topic so I'm going to narrow down this topic I am gonna talk about pipeline of course I'm gonna talk about one hot encoder and I'm not gonna talk about standard scalar though I have a resource I can share on that but I am gonna talk about column transformer because if you're using one hot encoder in a pipeline you probably need to use column transformer so I'm going to teach you that the other thing that's really important is I'm going to be using scikit-learn point 20 oh it's probably point 20 point two if you if you're running scikit-learn previous two point twenty you're not gonna have column transformer and one hot encoder is gonna work slightly differently so you won't be able to reuse this code unless you're you're using at least 0.20 in scikit-learn okay so with all of that being said let me go over to my empty notebook I am gonna try to do this quickly and it is hard to type and talk at the same time but I will do my best so I'm gonna read in an actual dataset be Titanic dataset which I know it's overused but but it's a useful way to teach this topic okay what do I have I have a data frame and you can follow along if you like that is a URL you can read from this is a data frame of 891 rows in twelve columns okay so what are my columns well they're all these let's just take a quick look at null values so these are the columns that have null values most of there are no null values there are few columns with null values okay I've picked out a couple columns I am going to focus on like you you do with any machine learning problem you have to select features and I'm just gonna select a couple features for teaching purposes so here is what I'm going to select I'm going to use a Lok and use dot Lok and for the moment I'm gonna say all rows but I'll change that in a second I'm gonna select the survived column which is our target the p class column the sex column and the embarked column okay if you've never heard of the Titanic data set you're predicting whether passengers on the Titanic survived or did not survive so survived as the target and then we're going to use these three features P classes passenger class sex is male or female embarked is a port they embarked from now everything I'm demoing gets a lot more complicated if I leave in these two rows that have null values so I'm going to exclude any rows in which embarked is missing okay and you'll see in a second what I mean okay so that is my great afraid I'm going to overwrite my existing beta frame with this we're going to take DF dot shape you will see I've lost two rows 891 to 889 and I'm down to four columns here are my four columns okay and there are now no null values and they got some let's just take a look at the head of the Staver data frame and here we are and once again let me just say survived is our target passenger class it's technically a categorical variable but for reasons I'm not gonna explain we're treating it as a numeric variable and that's actually the best approach and then why I've got two categorical variables okay now I'm gonna start by cross validating a model that predicts survived using only P class alright and then I will show you how to use pipeline as well as one hot encoder as well as column transformer - well you'll see okay so I've got the first thing I always do is to find my ax so I'm going to select all rows and only B P class column okay so that's my X and my Y is the F dot survived okay and X dot shape is 889 by 1 y dot shape is 889 by nothing just remember even if you have only one feature in your X in your training matrix it needs to be two-dimensional okay it can't be one-dimensional for reasons that take a while to explain but that is on purpose okay alright let's say I'm building a model I'm we're gonna just start with logistic regression because I love it's a classification problem and I love logistic regression from cycle and linear model import logistic regression alright and then I tend to name my models like this log reg equals an order logistic regression and it's gonna throw me a warning if I don't specify a solver and I could explain why but it's not super important also you can read the documentation if you want to decide which solver to use but this is this one will work just fine they all have different limitations and strengths and weaknesses in the scikit-learn documentation talks about it so anyway let's evaluate our model from s Kalyn got model selection which is new which is how they reorganized in like 0.18 maybe because this used to be in from SK learned got cross Val I think or cross-validation so from SK learn model selection import cross Val score and I want to cross validate my logistic regression model using not using using cross belt or so I'm passing it sorry I have to stop talking for a second so I can remember what to type a scoring kills accuracy okay all right so let me run this and we'll just talk about it for one second I'm cross validating a logistic regression model with one feature which is passenger class five fold cross validation checking the accuracy the accuracy is 67.8% the mean accuracy so the mean of the five folds of cross validation okay and just I always like to check how that compares to the how that compares to the null accuracy and the null accuracy is 61% and the null accuracy is the accuracy you would get by predicting the most frequent class okay so you can your know like you want to generally be Tanel accuracy you don't actually have to in all cases and that's another complicated topic but anyway the point of all this so far is to quickly build my basic cross validated model okay now what if here is like the motivating got a motivating question you might say I want to add more features to my model and cross validated how do I do that the answer is pipeline but first we have to talk about encoding the sex column and the embarked column okay so let's go back and show the head of the data frame okay for encoding categorical features if they are unordered usually the best approach is is called dummy encoding which is also known as one hot encoding okay scikit-learn calls it one hot encoding pandas calls it dummy encoding it's the same thing okay now we're gonna do it in scikit-learn and there's a bunch of reasons for the that I will explain at the end of this kind of lesson alright so if you want to use dummy encoder here's how you do it from SK learn got pre-processing import one hot encoder okay this is a funny name but it has a reason for it but I'll save that for another time right then you instantiate a one hot encoder just like you instantiate a model you make an instance of it okay there's my one hot encoder and for teaching purposes I need to make it not sparse and you can make it dense don't worry about it you can basically you never have to write that in the real world so one hot encoder like any scikit-learn transformer has a fit and a transform method and a fit transform that allows you to do both at the same time so if I pass it a data frame column be one hot encoder is going to one hot encode fils X column now let's look at these first three rows what it is done is create a numpy array with two columns the first column represents and I'll you'll see the first column represents female the second column represents male in other words this is saying male female female and you can confirm that the first three rows of the data frame male female female so this is the dummy encoding of the sex column now what I'm writing right now is code you don't have to write this is teaching code this import and the instantiation is what you would need to do for real okay so this is just teaching code lines 20 and 21 okay now if i encode if i one-hot encode the embark'd column let's look at the categories the three categories of embarked RC queue ass as such it creates three columns the first column represents C the second column represents Q the third column represents s so we can tell this is s C s are the first three rows and you can see here s CS you can confirm that it did properly one-hot encode that data now generally speaking the way i've taught people to do dummy encoding or one hot encoding is in pandas i taught that because up until version point 2 0 of scikit-learn it was painful to do in scikit-learn in my opinion okay it had way too many steps now it's easier and it's better for a bunch of reasons that I will explain so in the old days what I would do is I would dummy encode in the data frame so I would add the two so my data frame right now is four columns I would add two more columns for sex I might drop one of them we don't have to get into that and I would add three columns for embark'd so my data frame would keep getting wider and wider but I would do that in pandas and then I would select out p class and then all of the dummied columns as they're called and I would select those to become my X and then pass them to cross-validation okay so that's how I would do things previously but we are instead going to do it with pipeline okay so here's what we're gonna do okay I'm gonna define my X as if I previously oh I guess previously I defined X as one feature okay but now I'm gonna define it as three features so I'm actually going to do DF drop survived axis equals columns okay and you can see that here's my so you'll notice that I've defined my feature matrix capital X it's just these three columns now the next thing I need to show you is column transformer okay and let me import it from SK learn got compose import make column transformer okay and then I will start writing this okay but before I do this here is the use case for column transformer you use column transformer anytime you have features in your data frame that need different pre-processing okay what do I mean well dummy encoding or one hot encoding is a pre-processing step I want to employ I want to apply it to embarked and sex but I do not want to apply it to P class because we're treating that as a numeric variable not a categorical variable so I am going to create a column transformer that accomplishes that objective so what I'm passing to it I'm gonna pass well let me just type it and then I will it will I will explain what I'm doing I'll get you have to get used to how it's done but I think it'll make sense in a second Bart and then wait comma no comment after and then remainder equals pass through okay okay here is my column transformer okay I make a column transformer which says I want to apply a one hot encoder to these two columns in my data frame and the remainder of the columns I want to pass through okay and let me show you what that means it's Co so with a column transformer you will do a fit transform and we'll pass it our training data and here's what comes out and you'll notice what we have is these two columns the first two columns are the one hot encoded sex the next three columns these three columns are the one hot encoded embark'd the final column is a pass-through of the p class column because I didn't want to encode it okay so to be very clear if I had a bunch more columns that I wanted to do other pre-processing steps to I would have added them to the column transformer so I could say I want to one hot encode sex and embark'd I want to do a simple impute er to some other column I want to do something else to some other column and then the remainder you can either pass through or I think ignore okay so I'm using column transformer to do my pre-processing on all of my columns at the same time without doing it in pandas okay we are finally at the pipeline step from SK learn got pipeline import make pipeline there's both pipeline and mate pipeline and for reasons I won't get into I strongly prefer make pipeline but it's functionally equivalent so here is my pipeline I'm gonna make a pipeline of my column transformer and my logistic regression model okay so remember pipeline is for chaining steps together so I'm creating a pipeline that does the following things it takes my my data that I pass it it transforms the columns which is my pre-processing steps and then it builds my model which is logistic regression okay so it builds my model on the result so what am i doing with all of this if you've gotten lost I am now going to past the entire pipeline to cross Val score X Y C B equals five scoring equals accuracy I guess you can't autocomplete that got mean okay all right there you go our accuracy went up to 0.77 meaning this adding these two features improved our model from previously which was 0.67 so it went up from 0.67 to 0.77 which is great so I'm I want to do a few things just so you know where this lesson is going number one I'm gonna explain what just happened number two I'm gonna show you how to make predictions on new data number three I'm gonna show you just a quick some a recap of all the code in case you got lost and then number four I'm gonna comment on at a high level on why we're doing what we're doing okay so that's those are my steps and so let's keep this moving for it what happens when I run this line of code this means I am cross validating my entire pipeline in other words I am NOT cross validating a model I am cross validating a pipeline of steps that include pre-processing of data and model building in other words cross Val score is going to do my split of data my fivefold split and then after it splits the data it will then run the pipeline the point of cross validation remember is to evaluate your model so that you can decide whether you're building a good model and then you can use it to make predictions on new data so let's go ahead and make up some new data to pass to the model so I'm gonna make something called X new and as a kind of lazy way of doing this I'm gonna sample five rows from X and I know technically I shouldn't pull from training data to make my out-of-sample data because it's not out-of-sample data but it's the fastest way to create a good data frame with five rows in the real world I would be pulling out out-of-sample data for this but I just need some data to make predictions on okay so there's my data how normally if you've built your model and evaluated it and you want to make predictions what do you do you do like model dot fit well I don't have a model I have a pipeline that includes a model so I do pipe dot fit and I say I'm training it on X&Y ok and then I do pipe not predict X new okay so pipe dot fit is like model dot fit except it runs the pre-processing as well as the model fitting okay pipe dot predict is just like model dot predict except it runs the pre-processing on X new think about that for a second this is X new it has strings in there so this only works because the pipeline is doing the dummy encoding of the new data of the out-of-sample data and then is making predictions so it's actually quite amazing what's what it's accomplishing in that one line of code ok moving on the next thing I want to show you is a recap of what we just did if you got lost and everything we just did it's actually very little code this is like all of the in quote important code that we just wrote meaning that I I've eliminated the exploratory code and the teaching code and this is going to be in the notebook that I will share with you after the webcast here and I'll summarize it briefly here my imports ok here's where I read in my data frame I selected my columns I defined my X&Y here I made my column transformer made up of a1 Haughton coder and pass through the remaining columns here's my model here's my pipeline that's a column transformer and a model here's cross validation of the pipeline here is building my ex new data frame and here is fitting and then here's making predictions on new data okay so this is everything we just did if you got lost this is like the work flow all right I'm gonna leave that up on screen while I kind of comment on this for the next couple minutes what we just did is we used one hot encoder column transformer and pipeline here's the question why would we not use get dummies instead because we could have used Pandya that pandas get dummies appended those to the data frame and then defined X based on the pandas dataframe that's what I used to teach as I was saying but why is the approach that I just showed you better it is better in four important ways okay number one you don't have to create a gigantic data frame you'll notice that one hot encoder does not affect our data frame so our data frame stays three or four columns and that's it and that's easier to explore and easier to manage number two when new data comes in you don't have to use get dummies on it right because if you are using get dummies on all of your training data then when out-of-sample data comes in you still have to use get dummies on it plus you're gonna have problems if your out-of-sample data has different categories than your in sample data like let's say our out our in sample data had C Q and s but our out-of-sample data only had C and Q well it's not going to produce the correctly shaped data frame this is going to cause problems all right number three as to why this process with cycle is better you can do a grid search as I was mentioning with both model parameters and pre-processing parameters and then finally reason number four in some cases pre-processing outside of scikit-learn can make cross-validation scores less reliable okay and this gets complicated but basically if you're using a standard scaler if you're doing missing value imputation if you're using text data and a variety of other circumstances if you do your pre-processing before scikit-learn your cross-validation scores are possibly going to be unreliable okay so those are four huge reasons why you should you know ultimately use the process I've laid out rather than get dummies and pandas hope this video was helpful to you if you'd like to join my monthly webcasts and ask your own question sign up for my membership program at the $5 level by going to patreon.com slash data school there's a link in the description below or you can click the Box on your screen thank you so much for watching and I'll see you again soon

Original Description

In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn? In this video, you'll learn how to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step. You'll also learn how to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously. Finally, you'll learn why you should use scikit-learn (rather than pandas) for preprocessing your dataset. AGENDA: 0:00 Introduction 0:22 Why should you use a Pipeline? 2:30 Preview of the lesson 3:35 Loading and preparing a dataset 6:11 Cross-validating a simple model 10:00 Encoding categorical features with OneHotEncoder 15:01 Selecting columns for preprocessing with ColumnTransformer 19:00 Creating a two-step Pipeline 19:54 Cross-validating a Pipeline 21:44 Making predictions on new data 23:43 Recap of the lesson 24:50 Why should you use scikit-learn (rather than pandas) for preprocessing? CODE FROM THIS VIDEO: https://github.com/justmarkham/scikit-learn-videos/blob/master/10_categorical_features.ipynb WANT TO JOIN MY NEXT LIVE WEBCAST? Become a member ($5/month): https://www.patreon.com/dataschool === RELATED RESOURCES === OneHotEncoder documentation: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features ColumnTransformer documentation: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html#pipeline My video on cross-validation: https://www.youtube.com/watch?v=6dbrR-WymjI&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=7 My video on grid search: https://www.youtube.com/watch?v=Gol_qOgRqfA&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=8 My lesson notebook on StandardScaler: https://nbviewer.jupyter.org/github/j
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data School · Data School · 0 of 60

← Previous Next →
1 Setting up Git and GitHub
Setting up Git and GitHub
Data School
2 Navigating a GitHub Repository - Part 1
Navigating a GitHub Repository - Part 1
Data School
3 Forking a GitHub Repository
Forking a GitHub Repository
Data School
4 Creating a New GitHub Repository
Creating a New GitHub Repository
Data School
5 Copying a GitHub Repository to Your Local Computer
Copying a GitHub Repository to Your Local Computer
Data School
6 Committing Changes in Git and Pushing to a GitHub Repository
Committing Changes in Git and Pushing to a GitHub Repository
Data School
7 Syncing Your GitHub Fork
Syncing Your GitHub Fork
Data School
8 Allstate Purchase Prediction Challenge on Kaggle
Allstate Purchase Prediction Challenge on Kaggle
Data School
9 Troubleshooting: Updates Rejected When Pushing to GitHub
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
10 Hands-on dplyr tutorial for faster data manipulation in R
Hands-on dplyr tutorial for faster data manipulation in R
Data School
11 ROC Curves and Area Under the Curve (AUC) Explained
ROC Curves and Area Under the Curve (AUC) Explained
Data School
12 Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
13 What is machine learning, and how does it work?
What is machine learning, and how does it work?
Data School
14 Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
15 Getting started in scikit-learn with the famous iris dataset
Getting started in scikit-learn with the famous iris dataset
Data School
16 Training a machine learning model with scikit-learn
Training a machine learning model with scikit-learn
Data School
17 Comparing machine learning models in scikit-learn
Comparing machine learning models in scikit-learn
Data School
18 Data science in Python: pandas, seaborn, scikit-learn
Data science in Python: pandas, seaborn, scikit-learn
Data School
19 Selecting the best model in scikit-learn using cross-validation
Selecting the best model in scikit-learn using cross-validation
Data School
20 How to find the best model parameters in scikit-learn
How to find the best model parameters in scikit-learn
Data School
21 How to evaluate a classifier in scikit-learn
How to evaluate a classifier in scikit-learn
Data School
22 What is pandas? (Introduction to the Q&A series)
What is pandas? (Introduction to the Q&A series)
Data School
23 How do I read a tabular data file into pandas?
How do I read a tabular data file into pandas?
Data School
24 How do I select a pandas Series from a DataFrame?
How do I select a pandas Series from a DataFrame?
Data School
25 Why do some pandas commands end with parentheses (and others don't)?
Why do some pandas commands end with parentheses (and others don't)?
Data School
26 How do I rename columns in a pandas DataFrame?
How do I rename columns in a pandas DataFrame?
Data School
27 How do I remove columns from a pandas DataFrame?
How do I remove columns from a pandas DataFrame?
Data School
28 How do I sort a pandas DataFrame or a Series?
How do I sort a pandas DataFrame or a Series?
Data School
29 How do I filter rows of a pandas DataFrame by column value?
How do I filter rows of a pandas DataFrame by column value?
Data School
30 How do I apply multiple filter criteria to a pandas DataFrame?
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
31 Your pandas questions answered!
Your pandas questions answered!
Data School
32 How do I use the "axis" parameter in pandas?
How do I use the "axis" parameter in pandas?
Data School
33 How do I use string methods in pandas?
How do I use string methods in pandas?
Data School
34 How do I change the data type of a pandas Series?
How do I change the data type of a pandas Series?
Data School
35 When should I use a "groupby" in pandas?
When should I use a "groupby" in pandas?
Data School
36 How do I explore a pandas Series?
How do I explore a pandas Series?
Data School
37 How do I handle missing values in pandas?
How do I handle missing values in pandas?
Data School
38 What do I need to know about the pandas index? (Part 1)
What do I need to know about the pandas index? (Part 1)
Data School
39 What do I need to know about the pandas index? (Part 2)
What do I need to know about the pandas index? (Part 2)
Data School
40 How do I select multiple rows and columns from a pandas DataFrame?
How do I select multiple rows and columns from a pandas DataFrame?
Data School
41 Machine Learning with Text in scikit-learn (PyCon 2016)
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
42 When should I use the "inplace" parameter in pandas?
When should I use the "inplace" parameter in pandas?
Data School
43 How do I make my pandas DataFrame smaller and faster?
How do I make my pandas DataFrame smaller and faster?
Data School
44 How do I use pandas with scikit-learn to create Kaggle submissions?
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
45 More of your pandas questions answered!
More of your pandas questions answered!
Data School
46 How do I create dummy variables in pandas?
How do I create dummy variables in pandas?
Data School
47 How do I work with dates and times in pandas?
How do I work with dates and times in pandas?
Data School
48 How do I find and remove duplicate rows in pandas?
How do I find and remove duplicate rows in pandas?
Data School
49 How do I avoid a SettingWithCopyWarning in pandas?
How do I avoid a SettingWithCopyWarning in pandas?
Data School
50 How do I change display options in pandas?
How do I change display options in pandas?
Data School
51 How do I create a pandas DataFrame from another object?
How do I create a pandas DataFrame from another object?
Data School
52 How do I apply a function to a pandas Series or DataFrame?
How do I apply a function to a pandas Series or DataFrame?
Data School
53 Getting started with machine learning in Python (webcast)
Getting started with machine learning in Python (webcast)
Data School
54 Q&A about Machine Learning with Text (online course)
Q&A about Machine Learning with Text (online course)
Data School
55 Your pandas questions answered! (webcast)
Your pandas questions answered! (webcast)
Data School
56 Machine Learning with Text in scikit-learn (PyData DC 2016)
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
57 Write Pythonic Code for Better Data Science (webcast)
Write Pythonic Code for Better Data Science (webcast)
Data School
58 Web scraping in Python (Part 1): Getting started
Web scraping in Python (Part 1): Getting started
Data School
59 Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
60 Web scraping in Python (Part 3): Building a dataset
Web scraping in Python (Part 3): Building a dataset
Data School

This video teaches how to encode categorical features using scikit-learn's OneHotEncoder and ColumnTransformer, and how to use these tools in a pipeline for machine learning model building and cross-validation. It covers the importance of categorical feature encoding, how to apply one-hot encoding, and how to evaluate model performance using cross-validation.

Key Takeaways
  1. Read in a dataset
  2. Select features for the machine learning problem
  3. Use a pipeline to chain pre-processing steps and model building steps
  4. Perform cross-validation of the pipeline
  5. Use grid search or randomized search of both model and pre-processing parameters
  6. Instantiate OneHotEncoder
  7. Fit and transform the OneHotEncoder on a data frame column
  8. Define X as a feature
  9. Use pipeline for adding more features to the model
  10. Create a ColumnTransformer to apply one-hot encoding
💡 Using a pipeline with OneHotEncoder and ColumnTransformer allows for more flexibility and control over the encoding process, and enables proper cross-validation of the model building process.

Related AI Lessons

Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · AI
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression
Medium · Data Science
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak, comparing Ridge and Lasso regression techniques
Medium · Python

Chapters (12)

Introduction
0:22 Why should you use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →