How do I make my pandas DataFrame smaller and faster?

Data School · Beginner ·📰 AI News & Updates ·10y ago

Key Takeaways

The video demonstrates how to reduce the memory footprint and improve the efficiency of a pandas DataFrame by converting object columns to category type, introduced in pandas 0.15, and shows how to use the category data type to store strings as integers, reducing memory usage and speeding up computations.

Full Transcript

Hello and welcome back to my Q&A video series about the pandas library in Python. And the question for today comes from a YouTube viewer who says, "This is something I've read in pbython.com and the author uses type category like this. Can you explain to me what does the type category do and how to properly use it?" Okay, excellent question. So, as always, uh we will start uh by importing pandas as pd and then uh we need an example data set. So, we'll say drinks equals pdread csv and it's um bit.ly/drinks by country and this is a data set of um alcohol consumption by country. Okay. So, uh, drinks.head and you can see it's it's one country per row. And let's use the info data frame method which we have not seen before to get some more detail on this data frame. Okay. So, uh it tells us about the index. There are six columns. Uh how many non-null meaning nonm missing values are there in each column? and it tells us the data types. So we've got int, float, and object. So, uh, object usually means a string is being stored. But, um, you may not know that you can actually create a panda series of Python lists or a panda series of Python dictionaries. In other words, you can store arbitrary Python objects in a Pandanda series and pandas basically just stores a reference to that object. Okay. And the it just calls it type object. Okay. The thing I want to focus on is this memory usage at the bottom and it says 10.6 plus kilobyt. Okay. Now, why is there a plus there? Well, this is telling us that the data frame takes at least 10 kilobytes of memory. Pretty small data frame, but why the plus? And here's the reason. Um, because object columns are references to other objects, pandas wants this info method to run fast. So, it doesn't actually go out and look at the objects and figure out how much space they take. It actually just uh figures out how much space the references to those objects takes. So, it's saying it's at least 10.6 kilob, but it might be a lot more depending upon what's in those object columns. And again, in this case, they're just the strings for country and continent. Okay, so turns out you can actually force pandas to count the true memory usage. Okay, so we just say drinks.info and uh memory usage equals deep. Okay, and you'll see the same stuff except it will tell us pandas actually looked at the object columns and figured out how much space they were taking. And uh it says 25.8 kilobytes. So more than double uh the actual estimate. Now this takes a little bit longer to run but it is accurate. Okay. Now you might look at that and think well how much space is each column taking? Okay, each series. And you can actually do that. So uh let's say we actually just use the method drinks.memory usage. Okay. And uh this time it will tell us the memory usage in bytes. So 1544 bytes, not kilobytes. Now the thing this doesn't tell us is just like drinks.info, it does not kind of inspect those object columns by default. So if you want the real numbers, you have to say deep equals true. Okay. And now we've got the actual size in bytes uh used by each of those series. Okay. So uh and and one final thing I want to mention about this uh memory usage um is because uh this outputs a pandas series right um this is just the index of the series and these are the values. we can actually say just dotsum and we'll get about 25 kilobytes which is what we saw up here. Okay. So, uh the bottom line of trying to what I'm trying to tell you is that object columns can take up a lot of space. So, the question is uh especially if this was a really large data set and you know um we were storing too much uh the the data frame was growing too too large. How can I reduce that? How can I be more space efficient, especially with object columns? Well, think about this. What if we were able to store our strings as integers? Because integers are more space efficient than strings. Now, uh what do I mean by this? Well, think about this, okay? Um there are only six unique values in the continent series, for example. So drinks.cont continent do unique series method and I'm going to sort it. That's just a Python function. And you can see that here are the six unique values of continent. So let's pretend that instead of storing strings, we said okay I will store a zero to mean Africa, a one to mean Asia, a two to mean Europe, a three to mean North America and so on. Okay. Now by doing that we're only storing integers. So uh just to make sure this is very clear let's just do drinks.continent continent continent. So, uh if this is stored as strings, but if I wanted to store them in as integers, I would just say something like, uh, okay, Asia, that's one, this is two, this is zero, this is two, this is zero. Okay? And so on. Now, if we did this, if we invented this system, then you'd still have to store like a little lookup table that said when I say zero, I mean Africa. So, um uh we would basically we would still have to store the strings, but we would only have to store them once. Okay. Now if you like this idea um thankfully we don't have to implement this ourselves because pandas has created this system for us. There's a category data type and this was introduced in pandas 0.15 and I'll show you how to convert a column an object column to the category type. So we're going to say drinks bracket continent equals drinks.continent dot as type category. Okay, that's all. Okay, so let's check out the dtypes to see what we've got now. And you can see that country is still object but continent is now type category. Okay. And uh if you it still looks the same. So if you say drinks.cont continent.the head still looks the same here. But now notice you've got this thing down here. It says categories six. And it lists them. Here are the six categories. Okay. And I'm telling you that under the hood um it is storing these strings as integers. Okay. And here's kind of like a representation of that lookup table. Now I can prove to you uh the storage. So um I'm going to say drinks.continent.cat.codes.head. Okay. Let's deconstruct this continent series. And then you'll remember from a previous video how when you're like using a string method for a series, you would say like drinks.cont continent stircontains or something like that. Well, with categorical stuff, it's the same way. I say drinks.continent.cat and then there's some other stuff in there. So I can say dot and I hit tab and there's a bunch of things you can do. Okay, so I'm saying.cat.code. codes and let's just go ahead and look at the head. Okay. And you can see just like I was saying 1 12 020 that is exactly how pandas is now representing the continent series as integers. Okay. Now we talked about doing this to reduce our memory usage. Well let's take a look. uh drinks.memory usage deep equals true. Okay, look at that. Before continent was over 9,000 bytes, now it's less than 500 bytes. Okay. Now, to be very clear, um, instead of storing 193 strings, we're now storing 193 integers that point to a lookup table of six strings that says zero is Africa, one is Asia. So, the strings only have to be stored once, the rest is integer storage, thus much more space efficient. Okay. Now let's try to repeat this for the other uh um object column country. So let's just say drinks bracket uh country equals drinks.count as type category. Okay. And uh let's check the memory usage. Drinks domemory usage deep equals true. And you might have sensed this wasn't going to go very well. When we look at country, it has now gotten larger. Why is that? Well, if you remember the country series, every country was a different string. Okay? So, we created 193 categories. Okay? So, if we say drinks.count country.cat.categories, it's an attribute. You'll see there are 193 categories. So, previously uh before we converted it to a category, it was strings and we were storing 193 strings. Now, we're storing 193 integers, which are very small, but it points to a lookup table of 193 strings. So we're actually spending more memory than before to store the same thing. So the bottom line here is you use the category data type when you have when you have an object column of strings that only has a few different values. Okay, not a ton of different values. Now besides just reducing your memory usage when uh for continent at least okay it actually speeds up your computations. So if you are working with an object column of strings and you're doing computations with it, say you're you're doing a group by with that column, it will actually speed up your operations um uh if you use the category data type. So simply converting a column to a category um will not only save you space as long as there aren't too many values uh unique values and it will speed up computations. It is such an easy way uh to make your data frame uh smaller and faster. Okay. Now, uh, as always, um, I want to end with a bonus. And, um, to do this bonus, I need to create just a very small dataf frame. And so, I'm just going to say df equals pd.dataframe. And we'll talk more about uh, how to use pd.dataframe in a future video. Um but I'm actually going to pass it a dictionary in which the key is the column name and the values are uh it's just a list of values I want to put in the data frame. So uh and you'll see how this looks in a second three and uh we're going to get quality. Okay. uh list of strings. Good. Uh very good. Uh good. And excellent. Okay. Create my data frame and let's print it out. Okay. So, I've created a data frame with two columns and four rows. Don't worry so much about this code. Uh this is just an example data frame. And you can pretend that these the ID column are like ID numbers for some sort of uh item. Um so it's a ring or a stone or who knows. And the quality column is a rating of the quality of that item. Okay, kind of makes sense. Uh what if and you'll see in a moment why we're doing this. What if I said um df.sort sort values quality meaning I want to sort this data frame by the quality series. Well, how does it sort? Well, it seems kind of obvious. It sorts in alphabetical order. It says excellent, good, good, very good. So, E, G, G, V. Okay. Now, here's the thing. there's a logical ordering to these categories, right? Um, how can we tell pandas that there is actually a logical ordering? Well, here's what we're going to do. We're going to use the category data type and we're going to define ordered categories. Okay? So, df bracket quality equals df.quality quality dot as type category and then we're going to say two more things. So we're converting it to type category. We're going to say categories equals and we're going to say good, very good, very good, excellent. Okay. So, we're telling it the ordering the logical ordering of the categories from like lowest to highest kind of that's how it that's how we specify and then we say ordered equals true. Okay. So, if you look at the series, it still looks the same. Okay. Remember, we didn't we didn't change the sorting in place. So, DF.quality Quality is still good, very good, good, excellent. Just like here. Okay. However, it now tells you for the categories, good is less than very good is less than excellent. This is now ordered categories. And here's the cool part. When you've defined ordered categories, if you sort things, sort values, uh, quality, it now sorts them in logical order. So from good, good, very good, excellent. So it's not sorting by letter, it's sorting by um the ordering we defined for the categories. And probably the coolest thing about this in my opinion is that you can now use boolean conditions uh with this such as df.lo and what I mean is let's say I want to see all the rows where the quality is better than good. What rows do I want? df df.quality is greater than good. and I want to see all columns. And check that out. We only see the very good and excellent. And it did that. You can now use uh these comparison operators with the string because the categories are ordered and it understands that very good and excellent are greater than good. Okay. So, uh, that's it for today. Um, as always, uh, please click subscribe if you'd like to see more videos like this. Uh, let me know a question or a comment in the comment section below and maybe I will make a video answering your question. Um, but again, that's it. Uh, thanks so much for joining me and I hope to see you again soon.

Original Description

Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it. SUBSCRIBE to learn data science with Python: https://www.youtube.com/dataschool?sub_confirmation=1 JOIN the "Data School Insiders" community and receive exclusive rewards: https://www.patreon.com/dataschool == RESOURCES == GitHub repository for the series: https://github.com/justmarkham/pandas-videos "info" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html "memory_usage" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.memory_usage.html "astype" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html Overview of categorical data in pandas: http://pandas.pydata.org/pandas-docs/stable/categorical.html API reference for categorical methods: http://pandas.pydata.org/pandas-docs/stable/api.html#categorical == LET'S CONNECT! == Newsletter: https://www.dataschool.io/subscribe/ Twitter: https://twitter.com/justmarkham Facebook: https://www.facebook.com/DataScienceSchool/ LinkedIn: https://www.linkedin.com/in/justmarkham/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data School · Data School · 43 of 60

1 Setting up Git and GitHub
Setting up Git and GitHub
Data School
2 Navigating a GitHub Repository - Part 1
Navigating a GitHub Repository - Part 1
Data School
3 Forking a GitHub Repository
Forking a GitHub Repository
Data School
4 Creating a New GitHub Repository
Creating a New GitHub Repository
Data School
5 Copying a GitHub Repository to Your Local Computer
Copying a GitHub Repository to Your Local Computer
Data School
6 Committing Changes in Git and Pushing to a GitHub Repository
Committing Changes in Git and Pushing to a GitHub Repository
Data School
7 Syncing Your GitHub Fork
Syncing Your GitHub Fork
Data School
8 Allstate Purchase Prediction Challenge on Kaggle
Allstate Purchase Prediction Challenge on Kaggle
Data School
9 Troubleshooting: Updates Rejected When Pushing to GitHub
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
10 Hands-on dplyr tutorial for faster data manipulation in R
Hands-on dplyr tutorial for faster data manipulation in R
Data School
11 ROC Curves and Area Under the Curve (AUC) Explained
ROC Curves and Area Under the Curve (AUC) Explained
Data School
12 Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
13 What is machine learning, and how does it work?
What is machine learning, and how does it work?
Data School
14 Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
15 Getting started in scikit-learn with the famous iris dataset
Getting started in scikit-learn with the famous iris dataset
Data School
16 Training a machine learning model with scikit-learn
Training a machine learning model with scikit-learn
Data School
17 Comparing machine learning models in scikit-learn
Comparing machine learning models in scikit-learn
Data School
18 Data science in Python: pandas, seaborn, scikit-learn
Data science in Python: pandas, seaborn, scikit-learn
Data School
19 Selecting the best model in scikit-learn using cross-validation
Selecting the best model in scikit-learn using cross-validation
Data School
20 How to find the best model parameters in scikit-learn
How to find the best model parameters in scikit-learn
Data School
21 How to evaluate a classifier in scikit-learn
How to evaluate a classifier in scikit-learn
Data School
22 What is pandas? (Introduction to the Q&A series)
What is pandas? (Introduction to the Q&A series)
Data School
23 How do I read a tabular data file into pandas?
How do I read a tabular data file into pandas?
Data School
24 How do I select a pandas Series from a DataFrame?
How do I select a pandas Series from a DataFrame?
Data School
25 Why do some pandas commands end with parentheses (and others don't)?
Why do some pandas commands end with parentheses (and others don't)?
Data School
26 How do I rename columns in a pandas DataFrame?
How do I rename columns in a pandas DataFrame?
Data School
27 How do I remove columns from a pandas DataFrame?
How do I remove columns from a pandas DataFrame?
Data School
28 How do I sort a pandas DataFrame or a Series?
How do I sort a pandas DataFrame or a Series?
Data School
29 How do I filter rows of a pandas DataFrame by column value?
How do I filter rows of a pandas DataFrame by column value?
Data School
30 How do I apply multiple filter criteria to a pandas DataFrame?
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
31 Your pandas questions answered!
Your pandas questions answered!
Data School
32 How do I use the "axis" parameter in pandas?
How do I use the "axis" parameter in pandas?
Data School
33 How do I use string methods in pandas?
How do I use string methods in pandas?
Data School
34 How do I change the data type of a pandas Series?
How do I change the data type of a pandas Series?
Data School
35 When should I use a "groupby" in pandas?
When should I use a "groupby" in pandas?
Data School
36 How do I explore a pandas Series?
How do I explore a pandas Series?
Data School
37 How do I handle missing values in pandas?
How do I handle missing values in pandas?
Data School
38 What do I need to know about the pandas index? (Part 1)
What do I need to know about the pandas index? (Part 1)
Data School
39 What do I need to know about the pandas index? (Part 2)
What do I need to know about the pandas index? (Part 2)
Data School
40 How do I select multiple rows and columns from a pandas DataFrame?
How do I select multiple rows and columns from a pandas DataFrame?
Data School
41 Machine Learning with Text in scikit-learn (PyCon 2016)
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
42 When should I use the "inplace" parameter in pandas?
When should I use the "inplace" parameter in pandas?
Data School
How do I make my pandas DataFrame smaller and faster?
How do I make my pandas DataFrame smaller and faster?
Data School
44 How do I use pandas with scikit-learn to create Kaggle submissions?
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
45 More of your pandas questions answered!
More of your pandas questions answered!
Data School
46 How do I create dummy variables in pandas?
How do I create dummy variables in pandas?
Data School
47 How do I work with dates and times in pandas?
How do I work with dates and times in pandas?
Data School
48 How do I find and remove duplicate rows in pandas?
How do I find and remove duplicate rows in pandas?
Data School
49 How do I avoid a SettingWithCopyWarning in pandas?
How do I avoid a SettingWithCopyWarning in pandas?
Data School
50 How do I change display options in pandas?
How do I change display options in pandas?
Data School
51 How do I create a pandas DataFrame from another object?
How do I create a pandas DataFrame from another object?
Data School
52 How do I apply a function to a pandas Series or DataFrame?
How do I apply a function to a pandas Series or DataFrame?
Data School
53 Getting started with machine learning in Python (webcast)
Getting started with machine learning in Python (webcast)
Data School
54 Q&A about Machine Learning with Text (online course)
Q&A about Machine Learning with Text (online course)
Data School
55 Your pandas questions answered! (webcast)
Your pandas questions answered! (webcast)
Data School
56 Machine Learning with Text in scikit-learn (PyData DC 2016)
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
57 Write Pythonic Code for Better Data Science (webcast)
Write Pythonic Code for Better Data Science (webcast)
Data School
58 Web scraping in Python (Part 1): Getting started
Web scraping in Python (Part 1): Getting started
Data School
59 Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
60 Web scraping in Python (Part 3): Building a dataset
Web scraping in Python (Part 3): Building a dataset
Data School

This video teaches how to reduce the memory footprint and improve the efficiency of a pandas DataFrame by converting object columns to category type, and demonstrates how to use ordered categories for logical sorting and filtering. By applying these techniques, you can significantly improve the performance of your pandas DataFrames.

Key Takeaways
  1. Use the memory_usage method with the deep parameter to estimate the actual memory usage of each column
  2. Convert object columns to category type using the as_type method
  3. Define ordered categories to specify a logical ordering of categories
  4. Sort data in a logical order using ordered categories
  5. Use boolean conditions with ordered categories for filtering
💡 Converting object columns to category type can significantly reduce memory usage and improve computation efficiency in pandas DataFrames

Related AI Lessons

The AI Moat Paradox: The Better Models Become, the Less Models Matter
The AI moat paradox suggests that as AI models improve, their importance may decrease, and understanding this concept is crucial for AI professionals and businesses.
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the biggest AI research shifts of 2026 based on 170,927 papers, and learn how to apply these trends to your work
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the major research shifts in AI from 170,927 papers published in the first half of 2026, and learn how to analyze trends in AI research
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
In a world where AI makes everyone smart, the value of intelligence decreases, and new challenges arise
Medium · AI
Up next
‘ENOUGH IS ENOUGH’: Lebanon is STANDING UP to Iran, expert says
Fox Business
Watch →