How do I handle missing values in pandas?

Data School · Beginner ·📰 AI News & Updates ·10y ago

Key Takeaways

The video covers handling missing values in pandas, including detecting, dropping, and filling missing values using methods such as isnull(), dropna(), and fillna(). The pandas library is utilized for data manipulation and analysis.

Full Transcript

hello and welcome back to my Q&A video series about the pandas library in Python and the question for today is how do I handle missing values in pandas okay so uh let's get an example data set so we can figure out what to do um we'll import pandas as PD as always and then our example data set for today is uh a data set of UFO reported sightings so UFO equals pd. read CSV and then uh bit. lefo reports okay so we'll read in that data into a data frame and this time let's take a look at the tail the last five rows okay so we see this value n n and what does that mean is that a string like it's a color or what is that it actually uh stands for not a number n an but what it means conceptually is a missing value now uh what does that mean well it means that when building the data frame uh read CSV detected what it thought was a missing value and it flagged it with this special Nan and you'll see why in a bit now let's take a look at the source file for a second this is on GitHub and uh here are the last five rows and for instance in this very last row we've got ebore comma comma oval and this is a CSV file so normally you'd have something between there but it's missing and if we compare that to the data frame we see eore Nan meaning missing and then oval okay so why is it missing I don't know I'd probably guess that there was a web form where people submit these UFO reports and color um where some of the fields are optional and some are required okay so uh let's take a look at some methods for working with missing values uh the first method I want to show you is called is null and it's a data frame method so we'll say ufo. is null. tail okay and the way is null works is it uh shows a false if something's not null if it's something's not missing uh and it shows true if it is missing so you'll see City for these last five rows it's not missing at all so it's false false false false false false for colors reported it's Mo it's true true true false true because uh these ones are missing now it's only because pandas uses this special value na that is null can detect them and produce trues and falses as a result now there's actually a uh inverse method called not null so ufo. not null. tail and you'll see that we just get the exact opposite now why are these methods useful is null and not null uh let me show you some examples so one is a really cool pandis trick which is to say ufo. isnull dosm and what that tells us is the number of missing values in each of the columns now how did that work I'm going to break it down for you because there's a couple things you have to understand so um let's create a pretend Panda Series so pd. series and I'll pass it a list of booleans so I'll say true false true okay so that's I've just created this series of three booleans okay and if I do do sum you're thinking how do you sum booleans well it turns out that when you do that if you ask pandas to do a mathematical operation with booleans it will convert TRS to one and falses to zero so it shows you the number of truths okay so that is hinting you towards why this worked because I'm summing booleans but you can tell that I'm summing them at the column level and why is that well if you think back to a previous video about the axis parameter um sum is doing axis equals 0 by default and if you remember how the axis Works uh for a math matical operation axis Z is saying I want you to do the operation in the direction of the zero axis the row axis so I want you to do a sum across the rows which means Down The Columns okay so in summary when we say ufo. is null. sum ufo. is null is creating this data frame of trues and falses and then the sum is doing column sums of the trues and falses converting the ones to the trues to one and the falses to zero okay so that's why that worked um this is something I do on basically every data frame uh for every data set I read in because I want to know where the missing values are okay now I want to show you another way to use uh this um isnull method and um isnull it turns out is also a series method so I can say like ufo. city.is null okay now I'm actually going to pass that series of booleans to this using this bracket notation we've seen many times I'm passing it a condition that will be used to filter the data frame and when I do that we know that there are only 2 five missing values in all of this in in all of this city series we can see the 25 rows where city is missing okay so that's how we examine a subset of the data frame by only looking at a uh a portion of the data frame okay all right so that's some basic functionality but I haven't answered the question what should you do about missing values and there's not one right or wrong answer to that I can't tell you what to do I can just give you some options and pandas for how to do it and you can decide okay so it really depends upon the data set and the analysis you're doing or problem you're trying to solve so one common thing to do is to drop missing values okay so let's see some options there I'm going to say ufo. shape just to uh see what is the shape of this data frame 18,24 one rows five columns now there's a data frame method called drop Na and if I say ufo. dropna how equals any. shape now this is the default so you don't have to put it but I want to emphasize it drop in a how equals any that means I want you to drop a row if any of its values are um missing okay so if I do that we lose almost all of the rows except for about 2,500 rows okay so this is all that it's left that all that's left if we drop a row rows in which any of the five columns have a missing value okay now I want to mention for a second that um there is an in place parameter here for drop Na and that is false by default which is why if I if I do ufo. shape right now it hasn't changed from the original this is just um because it's in place equals false it just temporarily drops it and then they come right back okay all right now let's see how we might modify drop na okay so now let's try ufo. dropna how equals all. shape and that says only drop a row if all of its values are missing and we knew that wasn't going to drop any rows because up here we saw that state and time don't have any missing values so how equals all is not going to make any changes um in this case Okay so one more option for drop na is called subset so ufo. dropna subset equals and I'm going to pass it a list of columns to consider so I'm going to say City and shape reported and I'm going to uh say how equals any okay so this is saying I want to drop uh a row if either City or shape reported are missing if any of those two are missing for a given row I want you to drop drop it I want you to drop the row okay and we're left with about 15,000 rows okay now you can change this again you can change this to how equals all and that means only drop a row if all of these columns have a missing value for that row and this time only four rows gets dropped because there were only four rows where both city is missing and shape reported is missing okay all right as always we're going to end with a bonus and we're going to talk about filling missing values and um so let's take a look at our our old good old friend value counts so UFO bracket uh shape reported do value counts and value counts is saying how many times did this value occur in in the shape reported Series so the most common was light we've got dis triangle Circle cigar rectangle egg lots of fun values in here okay so you might be thinking wait a minute that doesn't add up to 18,000 and you're right because by default missing values are excluded okay so we if we say drop na equals false we can see that shape reported has those 2644 missing values okay and we saw that uh up here those 2644 missing values Okay so what if I wanted to F fill those in maybe I thought and and what to do here depends a lot upon an understanding of your data but you might think well maybe I should fill in the Nan's with various that kind of makes sense for for it being missing so to do that I'm going to use another method called fill na so I'm going to say UFO shape reported fillna value equals various okay and I'm going to say in place equals true which means make the change in place okay in in the UFO data frame and I'll run that and if I rerun the value counts you'll see that various now has the 333 that were already there and then all of the missing values were converted to various okay so that's it for today uh thanks so much for joining me as always if you want to see more videos like this please click subscribe and then let me know in the comment section below if you have a tip for others or a comment or a question and that's it so I hope to see you again soon

Original Description

Most datasets contain "missing values", meaning that the data is incomplete. Deciding how to handle missing values can be challenging! In this video, I'll cover all of the basics: how missing values are represented in pandas, how to locate them, and options for how to drop them or fill them in. SUBSCRIBE to learn data science with Python: https://www.youtube.com/dataschool?sub_confirmation=1 JOIN the "Data School Insiders" community and receive exclusive rewards: https://www.patreon.com/dataschool == RESOURCES == GitHub repository for the series: https://github.com/justmarkham/pandas-videos "read_csv" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html "isnull" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html "notnull" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html "dropna" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html "value_counts" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html "fillna" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html Working with missing data: http://pandas.pydata.org/pandas-docs/stable/missing_data.html == LET'S CONNECT! == Newsletter: https://www.dataschool.io/subscribe/ Twitter: https://twitter.com/justmarkham Facebook: https://www.facebook.com/DataScienceSchool/ LinkedIn: https://www.linkedin.com/in/justmarkham/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data School · Data School · 37 of 60

1 Setting up Git and GitHub
Setting up Git and GitHub
Data School
2 Navigating a GitHub Repository - Part 1
Navigating a GitHub Repository - Part 1
Data School
3 Forking a GitHub Repository
Forking a GitHub Repository
Data School
4 Creating a New GitHub Repository
Creating a New GitHub Repository
Data School
5 Copying a GitHub Repository to Your Local Computer
Copying a GitHub Repository to Your Local Computer
Data School
6 Committing Changes in Git and Pushing to a GitHub Repository
Committing Changes in Git and Pushing to a GitHub Repository
Data School
7 Syncing Your GitHub Fork
Syncing Your GitHub Fork
Data School
8 Allstate Purchase Prediction Challenge on Kaggle
Allstate Purchase Prediction Challenge on Kaggle
Data School
9 Troubleshooting: Updates Rejected When Pushing to GitHub
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
10 Hands-on dplyr tutorial for faster data manipulation in R
Hands-on dplyr tutorial for faster data manipulation in R
Data School
11 ROC Curves and Area Under the Curve (AUC) Explained
ROC Curves and Area Under the Curve (AUC) Explained
Data School
12 Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
13 What is machine learning, and how does it work?
What is machine learning, and how does it work?
Data School
14 Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
15 Getting started in scikit-learn with the famous iris dataset
Getting started in scikit-learn with the famous iris dataset
Data School
16 Training a machine learning model with scikit-learn
Training a machine learning model with scikit-learn
Data School
17 Comparing machine learning models in scikit-learn
Comparing machine learning models in scikit-learn
Data School
18 Data science in Python: pandas, seaborn, scikit-learn
Data science in Python: pandas, seaborn, scikit-learn
Data School
19 Selecting the best model in scikit-learn using cross-validation
Selecting the best model in scikit-learn using cross-validation
Data School
20 How to find the best model parameters in scikit-learn
How to find the best model parameters in scikit-learn
Data School
21 How to evaluate a classifier in scikit-learn
How to evaluate a classifier in scikit-learn
Data School
22 What is pandas? (Introduction to the Q&A series)
What is pandas? (Introduction to the Q&A series)
Data School
23 How do I read a tabular data file into pandas?
How do I read a tabular data file into pandas?
Data School
24 How do I select a pandas Series from a DataFrame?
How do I select a pandas Series from a DataFrame?
Data School
25 Why do some pandas commands end with parentheses (and others don't)?
Why do some pandas commands end with parentheses (and others don't)?
Data School
26 How do I rename columns in a pandas DataFrame?
How do I rename columns in a pandas DataFrame?
Data School
27 How do I remove columns from a pandas DataFrame?
How do I remove columns from a pandas DataFrame?
Data School
28 How do I sort a pandas DataFrame or a Series?
How do I sort a pandas DataFrame or a Series?
Data School
29 How do I filter rows of a pandas DataFrame by column value?
How do I filter rows of a pandas DataFrame by column value?
Data School
30 How do I apply multiple filter criteria to a pandas DataFrame?
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
31 Your pandas questions answered!
Your pandas questions answered!
Data School
32 How do I use the "axis" parameter in pandas?
How do I use the "axis" parameter in pandas?
Data School
33 How do I use string methods in pandas?
How do I use string methods in pandas?
Data School
34 How do I change the data type of a pandas Series?
How do I change the data type of a pandas Series?
Data School
35 When should I use a "groupby" in pandas?
When should I use a "groupby" in pandas?
Data School
36 How do I explore a pandas Series?
How do I explore a pandas Series?
Data School
How do I handle missing values in pandas?
How do I handle missing values in pandas?
Data School
38 What do I need to know about the pandas index? (Part 1)
What do I need to know about the pandas index? (Part 1)
Data School
39 What do I need to know about the pandas index? (Part 2)
What do I need to know about the pandas index? (Part 2)
Data School
40 How do I select multiple rows and columns from a pandas DataFrame?
How do I select multiple rows and columns from a pandas DataFrame?
Data School
41 Machine Learning with Text in scikit-learn (PyCon 2016)
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
42 When should I use the "inplace" parameter in pandas?
When should I use the "inplace" parameter in pandas?
Data School
43 How do I make my pandas DataFrame smaller and faster?
How do I make my pandas DataFrame smaller and faster?
Data School
44 How do I use pandas with scikit-learn to create Kaggle submissions?
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
45 More of your pandas questions answered!
More of your pandas questions answered!
Data School
46 How do I create dummy variables in pandas?
How do I create dummy variables in pandas?
Data School
47 How do I work with dates and times in pandas?
How do I work with dates and times in pandas?
Data School
48 How do I find and remove duplicate rows in pandas?
How do I find and remove duplicate rows in pandas?
Data School
49 How do I avoid a SettingWithCopyWarning in pandas?
How do I avoid a SettingWithCopyWarning in pandas?
Data School
50 How do I change display options in pandas?
How do I change display options in pandas?
Data School
51 How do I create a pandas DataFrame from another object?
How do I create a pandas DataFrame from another object?
Data School
52 How do I apply a function to a pandas Series or DataFrame?
How do I apply a function to a pandas Series or DataFrame?
Data School
53 Getting started with machine learning in Python (webcast)
Getting started with machine learning in Python (webcast)
Data School
54 Q&A about Machine Learning with Text (online course)
Q&A about Machine Learning with Text (online course)
Data School
55 Your pandas questions answered! (webcast)
Your pandas questions answered! (webcast)
Data School
56 Machine Learning with Text in scikit-learn (PyData DC 2016)
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
57 Write Pythonic Code for Better Data Science (webcast)
Write Pythonic Code for Better Data Science (webcast)
Data School
58 Web scraping in Python (Part 1): Getting started
Web scraping in Python (Part 1): Getting started
Data School
59 Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
60 Web scraping in Python (Part 3): Building a dataset
Web scraping in Python (Part 3): Building a dataset
Data School

This video teaches how to handle missing values in pandas, including detecting, dropping, and filling missing values. It covers the basics of missing value representation in pandas and provides options for handling them. By the end of this video, viewers will be able to detect, drop, and fill missing values in their datasets.

Key Takeaways
  1. Import pandas library
  2. Read in a CSV file into a pandas data frame
  3. Use isnull() method to detect missing values
  4. Sum isnull() method over columns to count missing values
  5. Use dropna() method to drop rows with missing values
  6. Use fillna() method to fill missing values with specified values
💡 The pandas library provides efficient methods for handling missing values, including detecting, dropping, and filling them, which is essential for data preprocessing and preparation for machine learning models.

Related AI Lessons

The AI Moat Paradox: The Better Models Become, the Less Models Matter
The AI moat paradox suggests that as AI models improve, their importance may decrease, and understanding this concept is crucial for AI professionals and businesses.
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the biggest AI research shifts of 2026 based on 170,927 papers, and learn how to apply these trends to your work
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the major research shifts in AI from 170,927 papers published in the first half of 2026, and learn how to analyze trends in AI research
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
In a world where AI makes everyone smart, the value of intelligence decreases, and new challenges arise
Medium · AI
Up next
‘ENOUGH IS ENOUGH’: Lebanon is STANDING UP to Iran, expert says
Fox Business
Watch →