In this article you’ll find some tips to reduce the amount of RAM used when working with pandas, the fundamental Python library for data analysis and data manipulation.
When dealing with large(ish) datasets, reducing the memory usage is something you need to consider if you’re stretching to the limits of using a single machine. For example, when you try to load some data from a big CSV file, you want to avoid your program crashing with a MemoryError. These tips can also help speeding up some downstream analytical queries.
The overall strategy boils down to choosing the right data types and loading only what you need. In this article you’ll learn about:
- Finding out how much memory is used
- Saving memory using categories
- Saving memory using smaller number representations
- Saving memory using sparse data (when you have a lot of NaN)
- Choosing the right dtypes when loading the data
- Loading only the columns you need
- Loading only a subset of rows
Finding out how much memory is used
First, let’s look into some simple steps to observe how much memory is taken by a pandas DataFrame.
For the examples I’m using a dataset about Olympic history from Kaggle. The dataset is in CSV format and takes roughly 40Mb on disk.
>>> import pandas as pd
>>> athletes = pd.read_csv('athlete_events.csv')
>>> athletes.shape
(271116, 15)
There are ~271K records with 15 columns
For a breakdown of the memory usage, column by column, we can use memory_usage() on the whole DataFrame. The memory is reported in bytes:
>>> athletes.memory_usage(deep=True)
Index 128
ID 2168928
Name 20697535
Sex 15724728
Age 2168928
Height 2168928
Weight 2168928
Team 17734961
NOC 16266960
Games 18435888
Year 2168928
Season 17080308
City 17563109
Sport 18031019
Event 24146495
Medal 9882241
dtype: int64
The function also works for a single column:
>>> athletes['Name'].memory_usage(deep=True)
20697663
The difference between the two outputs is due to the memory taken by the index: when calling the function on the whole DataFrame, the Index has its own entry (128 bytes), while for a single column (i.e. a pandas Series) the memory used by the index is aggregated.
For an aggregated figure on the whole table, we can simply sum:
>>> athletes.memory_usage(deep=True).sum()
186408012 # roughly 178Mb
Why do we need deep=True? This flag will introspect the data deeply, reporting the actual system-level memory consumption. Without setting this flag, the function returns an estimate which could be quite far from the actual number, for example:
>>> athletes.memory_usage()
Index 128
ID 2168928
Name 2168928
Sex 2168928
Age 2168928
Height 2168928
Weight 2168928
Team 2168928
NOC 2168928
Games 2168928
Year 2168928
Season 2168928
City 2168928
Sport 2168928
Event 2168928
Medal 2168928
dtype: int64
>>> athletes['Name'].memory_usage()
2169056
Another way of getting the overall memory consumption is through the function info(), which is going to be useful because it also gives us information on the data types (dtype) used by the DataFrame. Notice again the use of deep introspection for the memory usage:
>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 271116 non-null int64
1 Name 271116 non-null object
2 Sex 271116 non-null object
3 Age 261642 non-null float64
4 Height 210945 non-null float64
5 Weight 208241 non-null float64
6 Team 271116 non-null object
7 NOC 271116 non-null object
8 Games 271116 non-null object
9 Year 271116 non-null int64
10 Season 271116 non-null object
11 City 271116 non-null object
12 Sport 271116 non-null object
13 Event 271116 non-null object
14 Medal 39783 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 177.8 MB
Notice how all the string fields are loaded as object, while all the numerical fields use a 64-bit representation, due to the architecture of the local machine — it could be 32-bit with older hardware.
Saving memory using categories
Some of the variables in our dataset are categorical, meaning they only have a handful of possible values. Rather than using a generic object for these variables, when appropriate we can use the more relevant Categorical dtype in pandas. For example, good candidates for this data type include the variables Medal, Season, or Team, amongst others.
If you don’t have a full description of the data, in order to decide which columns should be treated as categorical, you can simply observe the number of unique values to confirm this is much smaller than the dataset size:
>>> athletes['Medal'].unique()
array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)
There are only three different values, plus the null value nan.
Observe the difference in memory consumption between using object and using categories:
>>> athletes['Medal'].memory_usage(deep=True)
9882369 # 9.4+ Mb
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539 # 0.25 Mb
Besides saving memory, another advantage of using categorical data in pandas is that we can include a notion of logical order between the values, different from the lexical order.
Saving memory using smaller number representations.
Let’s look at some numerical variables, for example ID (int64), Height (float64) and Year (int64).
You can observe their range by checking the minimum and maximum values:
>>> athletes['ID'].min(), athletes['ID'].max()
(1, 135571)
The int64 dtype is able to hold numbers on a much broader range, at the price of a much bigger memory footprint:
>>> import numpy as np
>>> np.iinfo('int64') # integer info
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)
Using int32 for the column ID is enough to store its values and it will save us half of the memory space:
>>> athletes['ID'].memory_usage(deep=True)
2169056
>>> athletes['ID'].astype('int32').memory_usage(deep=True)
1084592
We can do the same with floats:
>>> athletes['Height'].min(), athletes['Height'].max()
(127.0, 226.0)
In this case, a float16 is enough, and costs a quarter of the memory price:
>>> athletes['Height'].memory_usage(deep=True)
2169056
>>> athletes['Height'].astype('float16').memory_usage(deep=True)
542360
Finally, let’s look at the variable Year:
>>> athletes['Year'].min(), athletes['Year'].max()
(1896, 2016)
In this case, it looks like an int16 would be enough. On a closer look though, we can consider the Year column to be categorical as well: there’s only a handful of possible values. Let’s check the difference:
>>> athletes['Year'].memory_usage(deep=True)
2169056
>>> athletes['Year'].astype('int16').memory_usage(deep=True)
542360
>>> athletes['Year'].astype('category').memory_usage(deep=True)
272596
For this particular situation, it makes more sense to use categories rather than numbers, unless we plan on performing arithmetic operations on this column (you cannot sum or multiply two categories).
Saving memory using sparse data (when you have a lot of NaN)
The sparse dtypes in pandas are useful wen dealing with columns that have a lot of null values. Depending on your variables, you may want to consider representing your data as sparse. The info() function used earlier tells us how many non-null records we have for each column, so if that number is much lower than the size of the dataset, it means we have a lot of null values.
This is exactly the case of the Medal column that we treated as categorical earlier:
>>> athletes['Medal'].memory_usage(deep=True)
9882369
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539
>>> athletes['Medal'].astype('Sparse[category]').memory_usage(deep=True)
199067
Choosing the right dtypes when loading the data
So far we have looked at the memory usage of different dtypes, converting the columns after the dataset was loaded.
Once we have chosen the desired dtypes, we can make sure they are used when loading the data, by passing the schema as a dictionary to the read_csv() function:
>>> schema = {
... 'ID': 'int32',
... 'Height': 'float16',
... # add all your Column->dtype mappings
... }
>>> athletes = pd.read_csv('athlete_events.csv', dtype=schema)
Note: it’s not possible to use the Sparse dtype when loading the data in this way, we still need to convert the sparse columns after the dataset is loaded.
Loading only the columns you need
Depending on the application, we often don’t need the full set of columns in memory.
In our medal example, let’s say we simply want to compute the overall count of the medals per nation. For this specific use case, we only need to look at the columns Medal and NOC (National Olympic Committee).
We can pass the argument usecols to the read_csv() function:
>>> athletes = pd.read_csv('athlete_events.csv', usecols=['NOC', 'Medal'], dtype=schema)
>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NOC 271116 non-null category
1 Medal 39783 non-null category
dtypes: category(2)
memory usage: 816.3 KB
If we only want to count the medals per NOC, we can use a groupby operation on the above DataFrame:
>>> athletes.groupby('NOC')['Medal'].count()
# (long output omitted)
It’s also worth noting that for this groupby operation there is a significant speed-up when using the categorical data types.
Loading only a subset of rows
To complete the picture, the read_csv() function also offers options to limit the number of rows we’re loading. This can be useful in a few circumstances, for example when we just want to take a peek at the data without looking at the whole dataset, or when the dataset is big enough that we can answer our analytical questions with a sample.
The first case, just taking a peek at the data, is straightforward:
>>> athletes = pd.read_csv('athlete_events.csv', nrows=1000)
>>> len(athletes)
1000
Using the nrows argument, we’ll load the first N records (1000 in the example above) into the DataFrame. This can often be enough to have a first feeling for the data before digging into further analysis.
If we want to implement some random sampling instead, the read_csv() function also offers the skiprows argument. Specifically, if we pass a custom function to this argument, we can implement our sampling logic. The function takes one argument (the row number) and should return True if you want to skip that row.
In the example, we want to keep the first row because it has the column names, and we load only ~10% of the data, using the function random() which returns a random float in the [0, 1) range (if this number is greater than 0.1, we skip the row):
>>> from random import random
>>> def skip_record(row_number):
... return random() > 0.1 and row_number > 0
...
>>> athletes = pd.read_csv('athlete_events.csv', skiprows=skip_record)
>>> len(athletes)
27176 # 27K rows in the sample, ~271K in the full dataset
Summary
In this article we have discussed some options to save memory with pandas choosing the most appropriate data types and loading only the data that we need for our analysis.
Do you need to upskill your team in pandas? Marco runs public and private training courses on Effective pandas and other Python topics, please get in touch to know more.
Sign up to the newsletter Musings on Data to receive periodic updates and recommendations from Marco on Data Science.
Follow Marco on Twitter.