How do I use pandas with scikit-learn to create Kaggle submissions?
Skills:
Supervised Learning85%
Key Takeaways
The video demonstrates how to use pandas with scikit-learn to create Kaggle submissions, specifically using logistic regression on the Titanic dataset. It covers data preprocessing, feature engineering, and model training using pandas and scikit-learn.
Full Transcript
hello and welcome back to my QA video series about the pandas library in Python and the question for today comes from an email i received from rai ed and he says can you please do a short video on creating pandas dataframe objects from numpy arrays and then writing to a CSV file I had problems submitting the Titanic problem on Kaggle but later found some code that I copied to get the job done okay excellent question so let's start at the beginning here what is Kegel so Kaggle is a a popular platform for doing competitive machine learning ok and what is machine learning in a sentence I would say that machine learning is the semi automated extraction of knowledge from data okay now I'm not going to focus in this video on talking about machine learning I have a video called what is machine learning and how does it work I'll link to it in the description below if you're interested in learning more ok so in this video I am going to focus on how to use pandas in partnership with scikit-learn for machine learning ok so as always we need an example data set and I'm going to use the data set from kegels Titanic competition you can follow along and by the end of this video you will have a submission file for cackled ok so we're going to import pandas as PD and then I'm going to create a data frame called train and it'll be PD read CSV and I'm going to pass it the URL fit dot Lee slash Kaggle train ok and it's called train because this is our training data it's the data that our model is going to learn from ok so let's look at the head and what we have is each row represents a passenger aboard the Titanic and the Titanic was a ship that sank a long time ago so each passenger it has some attributes of that passenger as well as this column survived which is one if the person survived and zero if the person did not so the goal of this competition is off is for the tests at which we'll get to later for the test set predict survival based upon other characteristics of the passenger okay so first thing we need to do is create our feature matrix X these are the features the columns that our model is going to learn from okay so I'm going to actually just create a list a Python list called feature calls and it's going to be a list of strings and I've selected P class and parch which stands for parents and children number of parents children P classes for passenger class I'm just creating a Python list and this is these are the two columns I've selected as my features okay so I need to create X my feature matrix and we're just going to say train dot lope which is how I select rows and columns from the data frame so I want all rows and I want the feature calls columns okay so this is a data frame a panda's data frame I've created and we can check the shape and it is 891 rows by two columns okay so there's my features now I want to create my response factor also known as the target factor the thing you are trying to predict and that's the survived series so I'm just going to say y equals Train survive ok now this as I said is a panda series and we'll check the shape and it's 891 by nothing it's it's got only got one dimension okay so x and y are pandas objects there's no need to actually convert them to numpy arrays um and in fact scikit-learn will understand the x and y objects as long as they're the right shapes and they are fully numeric okay so no need to convert them to different object types alright now we're ready to build our scikit-learn model and i'm not going to explain this code because it would take quite a while I actually have a video series which I'll link below um called introduction to machine learning with scikit-learn it's 4 hours long and I would love for you to check it out if you're interested in learning about machine learning okay so here's my quick code to create a classification model so I'm going to say from SK learn dot linear linear model import logistic regression which is a classification model then I'm going to instantiate it let's just stick regression ok and then I'm going to fit the model to my training data log reg dot fit X Y ok so we have now fitted our machine learning model and I'm going to return to some pandas code so we need to read in our test data ok the test data is the data we're going to make predictions on and so I'm going to say test equals PD read CSV bit dot leak slash kaggle test okay and if we check out the head we will see that it looks very similar except it is missing the survived column and the reason for that is because we are predicting that we don't know the survived column for these people that is what we are predicting okay so I need to create a new X so I'm going to call it X new from the testing data okay so I'll say test dot local rows and these two feature columns okay and we'll check the shape woops x new dot shape and it is 418 rows by two columns so there's 418 observations in the testing data we need to make 418 predictions okay so we're ready to make our predictions one more line of scikit-learn code and I'm going to say new pred class meaning for this new data what are the predicted classes it's a classification problem and I'll just say log reg predict X nu ok now I've actually got everything I need I just need to put it in a CSV file kaggle asks for a CSV file with two columns the first column is going to be the passenger ID from the testing set so test dot passenger ID and it's just these numbers it's a panda series okay and then the second thing it wants is the predicted classes so new pred class and it's just a bunch of zeros and ones our 418 predictions okay so how do I create a CSV file with these two columns I'm actually just going to use the data frame constructor so PD data and there's lots of different types of objects you can pass to a data frame and it will figure out how to create the data frame you could pass it an umpire array you could pass it a list of lists I will cover that in a future video but for today I'm going to pass it a dictionary okay and I'm going to say a passenger ID : test dot passenger ID okay and then I'm going to say survived : new pred class okay so what am i doing I'm saying I want two columns that should be called passenger ID and survived and then I'm going to put this will be the test that passenger ID are the values that I'll put in that series new pred class are the values that I'll put in this series and pandas will automatically align them next to one another okay so we create that and here's what it looks like now one thing I need to make sure is that passenger ID is the first column now dictionaries are unordered so it's impossible to guarantee that passenger ID will come first so actually the easiest way to ensure a particular column is the first one in the data frame is to set it as the index so I'm going to just say dot set index passenger ID okay so I've accomplished so all I was trying to do is make sure it is the very first column there are other ways to do it I just like this way okay final step is we're just going to use a data frame method called to underscore CSV and all you have to do with to underscore CSV is to just pass it the name of a file that you want to create so I'm going to just say sub dot CSV which is short for submission a Kaggle submission okay so when I run that line of code it creates that CSV in my working directory it automatically includes the index of the data frame as the first column if you ever want to exclude the index that's just a parameter to to CSV that you can use okay so if you've been following along you could submit this file to Kaggle right now it's not going to perform very well but my goal in this video is to show you the workflow for using um pandas and scikit-learn together okay so as always I'm going to end with a bonus and here's the bonus um have you ever wanted to save a Python object such as a data frame to disk okay so um there's one way you can do that just in python for any Python object but with data frames um we are going to use something a method called to pickle because the object are called pickle objects okay so we'll just say train dot to underscore pickle and all you have to pass it is the name of the file you want to create and I'll just say train dot PKL so I'm just taking the train data frame and saving it to disk I could put it on a flash drive bring it to a different computer or something like that and load it right up without you know going through all the steps I had to do to create the data frame so I'll pickle it that's what it's called and then when you want to read it into pandas we're just going to say P dot read pickle okay and again we just pass it that same file name and it will look for that file in the working directory and read it in and we've now got our data frame back okay so that is it for today thank you so much for joining me as always please click Subscribe if you'd like to see more videos like this please leave me a question or a comment in the comment section below I'd love to hear what you have to say and again that's it so thank you so much for joining me and I hope to see you again soon
Original Description
Have you been using scikit-learn for machine learning, and wondering whether pandas could help you to prepare your data and export your predictions? In this video, I'll demonstrate the simplest way to integrate pandas into your machine learning workflow, and will create a submission for Kaggle's Titanic competition in just a few lines of code!
VIDEO: What is machine learning, and how does it work? https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1
VIDEO SERIES: Introduction to machine learning with scikit-learn: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
SUBSCRIBE to learn data science with Python:
https://www.youtube.com/dataschool?sub_confirmation=1
JOIN the "Data School Insiders" community and receive exclusive rewards:
https://www.patreon.com/dataschool
== RESOURCES ==
GitHub repository for the series: https://github.com/justmarkham/pandas-videos
Kaggle's Titanic competition: https://www.kaggle.com/c/titanic
"loc" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
"DataFrame" constructor documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
"to_csv" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
"to_pickle" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html
"read_pickle" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html
== LET'S CONNECT! ==
Newsletter: https://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
LinkedIn: https://www.linkedin.com/in/justmarkham/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data School · Data School · 44 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
▶
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Setting up Git and GitHub
Data School
Navigating a GitHub Repository - Part 1
Data School
Forking a GitHub Repository
Data School
Creating a New GitHub Repository
Data School
Copying a GitHub Repository to Your Local Computer
Data School
Committing Changes in Git and Pushing to a GitHub Repository
Data School
Syncing Your GitHub Fork
Data School
Allstate Purchase Prediction Challenge on Kaggle
Data School
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
Hands-on dplyr tutorial for faster data manipulation in R
Data School
ROC Curves and Area Under the Curve (AUC) Explained
Data School
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
What is machine learning, and how does it work?
Data School
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
Getting started in scikit-learn with the famous iris dataset
Data School
Training a machine learning model with scikit-learn
Data School
Comparing machine learning models in scikit-learn
Data School
Data science in Python: pandas, seaborn, scikit-learn
Data School
Selecting the best model in scikit-learn using cross-validation
Data School
How to find the best model parameters in scikit-learn
Data School
How to evaluate a classifier in scikit-learn
Data School
What is pandas? (Introduction to the Q&A series)
Data School
How do I read a tabular data file into pandas?
Data School
How do I select a pandas Series from a DataFrame?
Data School
Why do some pandas commands end with parentheses (and others don't)?
Data School
How do I rename columns in a pandas DataFrame?
Data School
How do I remove columns from a pandas DataFrame?
Data School
How do I sort a pandas DataFrame or a Series?
Data School
How do I filter rows of a pandas DataFrame by column value?
Data School
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
Your pandas questions answered!
Data School
How do I use the "axis" parameter in pandas?
Data School
How do I use string methods in pandas?
Data School
How do I change the data type of a pandas Series?
Data School
When should I use a "groupby" in pandas?
Data School
How do I explore a pandas Series?
Data School
How do I handle missing values in pandas?
Data School
What do I need to know about the pandas index? (Part 1)
Data School
What do I need to know about the pandas index? (Part 2)
Data School
How do I select multiple rows and columns from a pandas DataFrame?
Data School
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
When should I use the "inplace" parameter in pandas?
Data School
How do I make my pandas DataFrame smaller and faster?
Data School
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
More of your pandas questions answered!
Data School
How do I create dummy variables in pandas?
Data School
How do I work with dates and times in pandas?
Data School
How do I find and remove duplicate rows in pandas?
Data School
How do I avoid a SettingWithCopyWarning in pandas?
Data School
How do I change display options in pandas?
Data School
How do I create a pandas DataFrame from another object?
Data School
How do I apply a function to a pandas Series or DataFrame?
Data School
Getting started with machine learning in Python (webcast)
Data School
Q&A about Machine Learning with Text (online course)
Data School
Your pandas questions answered! (webcast)
Data School
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
Write Pythonic Code for Better Data Science (webcast)
Data School
Web scraping in Python (Part 1): Getting started
Data School
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
Web scraping in Python (Part 3): Building a dataset
Data School
More on: Supervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI