How do I make my pandas DataFrame smaller and faster?
Skills:
ML Maths Basics80%
Key Takeaways
The video demonstrates how to reduce the memory footprint and improve the efficiency of a pandas DataFrame by converting object columns to category type, introduced in pandas 0.15, and shows how to use the category data type to store strings as integers, reducing memory usage and speeding up computations.
Full Transcript
Hello and welcome back to my Q&A video series about the pandas library in Python. And the question for today comes from a YouTube viewer who says, "This is something I've read in pbython.com and the author uses type category like this. Can you explain to me what does the type category do and how to properly use it?" Okay, excellent question. So, as always, uh we will start uh by importing pandas as pd and then uh we need an example data set. So, we'll say drinks equals pdread csv and it's um bit.ly/drinks by country and this is a data set of um alcohol consumption by country. Okay. So, uh, drinks.head and you can see it's it's one country per row. And let's use the info data frame method which we have not seen before to get some more detail on this data frame. Okay. So, uh it tells us about the index. There are six columns. Uh how many non-null meaning nonm missing values are there in each column? and it tells us the data types. So we've got int, float, and object. So, uh, object usually means a string is being stored. But, um, you may not know that you can actually create a panda series of Python lists or a panda series of Python dictionaries. In other words, you can store arbitrary Python objects in a Pandanda series and pandas basically just stores a reference to that object. Okay. And the it just calls it type object. Okay. The thing I want to focus on is this memory usage at the bottom and it says 10.6 plus kilobyt. Okay. Now, why is there a plus there? Well, this is telling us that the data frame takes at least 10 kilobytes of memory. Pretty small data frame, but why the plus? And here's the reason. Um, because object columns are references to other objects, pandas wants this info method to run fast. So, it doesn't actually go out and look at the objects and figure out how much space they take. It actually just uh figures out how much space the references to those objects takes. So, it's saying it's at least 10.6 kilob, but it might be a lot more depending upon what's in those object columns. And again, in this case, they're just the strings for country and continent. Okay, so turns out you can actually force pandas to count the true memory usage. Okay, so we just say drinks.info and uh memory usage equals deep. Okay, and you'll see the same stuff except it will tell us pandas actually looked at the object columns and figured out how much space they were taking. And uh it says 25.8 kilobytes. So more than double uh the actual estimate. Now this takes a little bit longer to run but it is accurate. Okay. Now you might look at that and think well how much space is each column taking? Okay, each series. And you can actually do that. So uh let's say we actually just use the method drinks.memory usage. Okay. And uh this time it will tell us the memory usage in bytes. So 1544 bytes, not kilobytes. Now the thing this doesn't tell us is just like drinks.info, it does not kind of inspect those object columns by default. So if you want the real numbers, you have to say deep equals true. Okay. And now we've got the actual size in bytes uh used by each of those series. Okay. So uh and and one final thing I want to mention about this uh memory usage um is because uh this outputs a pandas series right um this is just the index of the series and these are the values. we can actually say just dotsum and we'll get about 25 kilobytes which is what we saw up here. Okay. So, uh the bottom line of trying to what I'm trying to tell you is that object columns can take up a lot of space. So, the question is uh especially if this was a really large data set and you know um we were storing too much uh the the data frame was growing too too large. How can I reduce that? How can I be more space efficient, especially with object columns? Well, think about this. What if we were able to store our strings as integers? Because integers are more space efficient than strings. Now, uh what do I mean by this? Well, think about this, okay? Um there are only six unique values in the continent series, for example. So drinks.cont continent do unique series method and I'm going to sort it. That's just a Python function. And you can see that here are the six unique values of continent. So let's pretend that instead of storing strings, we said okay I will store a zero to mean Africa, a one to mean Asia, a two to mean Europe, a three to mean North America and so on. Okay. Now by doing that we're only storing integers. So uh just to make sure this is very clear let's just do drinks.continent continent continent. So, uh if this is stored as strings, but if I wanted to store them in as integers, I would just say something like, uh, okay, Asia, that's one, this is two, this is zero, this is two, this is zero. Okay? And so on. Now, if we did this, if we invented this system, then you'd still have to store like a little lookup table that said when I say zero, I mean Africa. So, um uh we would basically we would still have to store the strings, but we would only have to store them once. Okay. Now if you like this idea um thankfully we don't have to implement this ourselves because pandas has created this system for us. There's a category data type and this was introduced in pandas 0.15 and I'll show you how to convert a column an object column to the category type. So we're going to say drinks bracket continent equals drinks.continent dot as type category. Okay, that's all. Okay, so let's check out the dtypes to see what we've got now. And you can see that country is still object but continent is now type category. Okay. And uh if you it still looks the same. So if you say drinks.cont continent.the head still looks the same here. But now notice you've got this thing down here. It says categories six. And it lists them. Here are the six categories. Okay. And I'm telling you that under the hood um it is storing these strings as integers. Okay. And here's kind of like a representation of that lookup table. Now I can prove to you uh the storage. So um I'm going to say drinks.continent.cat.codes.head. Okay. Let's deconstruct this continent series. And then you'll remember from a previous video how when you're like using a string method for a series, you would say like drinks.cont continent stircontains or something like that. Well, with categorical stuff, it's the same way. I say drinks.continent.cat and then there's some other stuff in there. So I can say dot and I hit tab and there's a bunch of things you can do. Okay, so I'm saying.cat.code. codes and let's just go ahead and look at the head. Okay. And you can see just like I was saying 1 12 020 that is exactly how pandas is now representing the continent series as integers. Okay. Now we talked about doing this to reduce our memory usage. Well let's take a look. uh drinks.memory usage deep equals true. Okay, look at that. Before continent was over 9,000 bytes, now it's less than 500 bytes. Okay. Now, to be very clear, um, instead of storing 193 strings, we're now storing 193 integers that point to a lookup table of six strings that says zero is Africa, one is Asia. So, the strings only have to be stored once, the rest is integer storage, thus much more space efficient. Okay. Now let's try to repeat this for the other uh um object column country. So let's just say drinks bracket uh country equals drinks.count as type category. Okay. And uh let's check the memory usage. Drinks domemory usage deep equals true. And you might have sensed this wasn't going to go very well. When we look at country, it has now gotten larger. Why is that? Well, if you remember the country series, every country was a different string. Okay? So, we created 193 categories. Okay? So, if we say drinks.count country.cat.categories, it's an attribute. You'll see there are 193 categories. So, previously uh before we converted it to a category, it was strings and we were storing 193 strings. Now, we're storing 193 integers, which are very small, but it points to a lookup table of 193 strings. So we're actually spending more memory than before to store the same thing. So the bottom line here is you use the category data type when you have when you have an object column of strings that only has a few different values. Okay, not a ton of different values. Now besides just reducing your memory usage when uh for continent at least okay it actually speeds up your computations. So if you are working with an object column of strings and you're doing computations with it, say you're you're doing a group by with that column, it will actually speed up your operations um uh if you use the category data type. So simply converting a column to a category um will not only save you space as long as there aren't too many values uh unique values and it will speed up computations. It is such an easy way uh to make your data frame uh smaller and faster. Okay. Now, uh, as always, um, I want to end with a bonus. And, um, to do this bonus, I need to create just a very small dataf frame. And so, I'm just going to say df equals pd.dataframe. And we'll talk more about uh, how to use pd.dataframe in a future video. Um but I'm actually going to pass it a dictionary in which the key is the column name and the values are uh it's just a list of values I want to put in the data frame. So uh and you'll see how this looks in a second three and uh we're going to get quality. Okay. uh list of strings. Good. Uh very good. Uh good. And excellent. Okay. Create my data frame and let's print it out. Okay. So, I've created a data frame with two columns and four rows. Don't worry so much about this code. Uh this is just an example data frame. And you can pretend that these the ID column are like ID numbers for some sort of uh item. Um so it's a ring or a stone or who knows. And the quality column is a rating of the quality of that item. Okay, kind of makes sense. Uh what if and you'll see in a moment why we're doing this. What if I said um df.sort sort values quality meaning I want to sort this data frame by the quality series. Well, how does it sort? Well, it seems kind of obvious. It sorts in alphabetical order. It says excellent, good, good, very good. So, E, G, G, V. Okay. Now, here's the thing. there's a logical ordering to these categories, right? Um, how can we tell pandas that there is actually a logical ordering? Well, here's what we're going to do. We're going to use the category data type and we're going to define ordered categories. Okay? So, df bracket quality equals df.quality quality dot as type category and then we're going to say two more things. So we're converting it to type category. We're going to say categories equals and we're going to say good, very good, very good, excellent. Okay. So, we're telling it the ordering the logical ordering of the categories from like lowest to highest kind of that's how it that's how we specify and then we say ordered equals true. Okay. So, if you look at the series, it still looks the same. Okay. Remember, we didn't we didn't change the sorting in place. So, DF.quality Quality is still good, very good, good, excellent. Just like here. Okay. However, it now tells you for the categories, good is less than very good is less than excellent. This is now ordered categories. And here's the cool part. When you've defined ordered categories, if you sort things, sort values, uh, quality, it now sorts them in logical order. So from good, good, very good, excellent. So it's not sorting by letter, it's sorting by um the ordering we defined for the categories. And probably the coolest thing about this in my opinion is that you can now use boolean conditions uh with this such as df.lo and what I mean is let's say I want to see all the rows where the quality is better than good. What rows do I want? df df.quality is greater than good. and I want to see all columns. And check that out. We only see the very good and excellent. And it did that. You can now use uh these comparison operators with the string because the categories are ordered and it understands that very good and excellent are greater than good. Okay. So, uh, that's it for today. Um, as always, uh, please click subscribe if you'd like to see more videos like this. Uh, let me know a question or a comment in the comment section below and maybe I will make a video answering your question. Um, but again, that's it. Uh, thanks so much for joining me and I hope to see you again soon.
Original Description
Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it.
SUBSCRIBE to learn data science with Python:
https://www.youtube.com/dataschool?sub_confirmation=1
JOIN the "Data School Insiders" community and receive exclusive rewards:
https://www.patreon.com/dataschool
== RESOURCES ==
GitHub repository for the series: https://github.com/justmarkham/pandas-videos
"info" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
"memory_usage" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.memory_usage.html
"astype" documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html
Overview of categorical data in pandas: http://pandas.pydata.org/pandas-docs/stable/categorical.html
API reference for categorical methods: http://pandas.pydata.org/pandas-docs/stable/api.html#categorical
== LET'S CONNECT! ==
Newsletter: https://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
LinkedIn: https://www.linkedin.com/in/justmarkham/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data School · Data School · 43 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
▶
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Setting up Git and GitHub
Data School
Navigating a GitHub Repository - Part 1
Data School
Forking a GitHub Repository
Data School
Creating a New GitHub Repository
Data School
Copying a GitHub Repository to Your Local Computer
Data School
Committing Changes in Git and Pushing to a GitHub Repository
Data School
Syncing Your GitHub Fork
Data School
Allstate Purchase Prediction Challenge on Kaggle
Data School
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
Hands-on dplyr tutorial for faster data manipulation in R
Data School
ROC Curves and Area Under the Curve (AUC) Explained
Data School
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
What is machine learning, and how does it work?
Data School
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
Getting started in scikit-learn with the famous iris dataset
Data School
Training a machine learning model with scikit-learn
Data School
Comparing machine learning models in scikit-learn
Data School
Data science in Python: pandas, seaborn, scikit-learn
Data School
Selecting the best model in scikit-learn using cross-validation
Data School
How to find the best model parameters in scikit-learn
Data School
How to evaluate a classifier in scikit-learn
Data School
What is pandas? (Introduction to the Q&A series)
Data School
How do I read a tabular data file into pandas?
Data School
How do I select a pandas Series from a DataFrame?
Data School
Why do some pandas commands end with parentheses (and others don't)?
Data School
How do I rename columns in a pandas DataFrame?
Data School
How do I remove columns from a pandas DataFrame?
Data School
How do I sort a pandas DataFrame or a Series?
Data School
How do I filter rows of a pandas DataFrame by column value?
Data School
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
Your pandas questions answered!
Data School
How do I use the "axis" parameter in pandas?
Data School
How do I use string methods in pandas?
Data School
How do I change the data type of a pandas Series?
Data School
When should I use a "groupby" in pandas?
Data School
How do I explore a pandas Series?
Data School
How do I handle missing values in pandas?
Data School
What do I need to know about the pandas index? (Part 1)
Data School
What do I need to know about the pandas index? (Part 2)
Data School
How do I select multiple rows and columns from a pandas DataFrame?
Data School
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
When should I use the "inplace" parameter in pandas?
Data School
How do I make my pandas DataFrame smaller and faster?
Data School
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
More of your pandas questions answered!
Data School
How do I create dummy variables in pandas?
Data School
How do I work with dates and times in pandas?
Data School
How do I find and remove duplicate rows in pandas?
Data School
How do I avoid a SettingWithCopyWarning in pandas?
Data School
How do I change display options in pandas?
Data School
How do I create a pandas DataFrame from another object?
Data School
How do I apply a function to a pandas Series or DataFrame?
Data School
Getting started with machine learning in Python (webcast)
Data School
Q&A about Machine Learning with Text (online course)
Data School
Your pandas questions answered! (webcast)
Data School
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
Write Pythonic Code for Better Data Science (webcast)
Data School
Web scraping in Python (Part 1): Getting started
Data School
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
Web scraping in Python (Part 3): Building a dataset
Data School
More on: ML Maths Basics
View skill →Related AI Lessons
🎓
Tutor Explanation
DeepCamp AI