September 2021 – Marco Bonzanini

In this article you’ll find some tips to reduce the amount of RAM used when working with pandas, the fundamental Python library for data analysis and data manipulation.

When dealing with large(ish) datasets, reducing the memory usage is something you need to consider if you’re stretching to the limits of using a single machine. For example, when you try to load some data from a big CSV file, you want to avoid your program crashing with a MemoryError. These tips can also help speeding up some downstream analytical queries.

The overall strategy boils down to choosing the right data types and loading only what you need. In this article you’ll learn about:

Finding out how much memory is used
Saving memory using categories
Saving memory using smaller number representations
Saving memory using sparse data (when you have a lot of NaN)
Choosing the right dtypes when loading the data
Loading only the columns you need
Loading only a subset of rows

Finding out how much memory is used

First, let’s look into some simple steps to observe how much memory is taken by a pandas DataFrame.

For the examples I’m using a dataset about Olympic history from Kaggle. The dataset is in CSV format and takes roughly 40Mb on disk.

>>> import pandas as pd
>>> athletes = pd.read_csv('athlete_events.csv')
>>> athletes.shape
(271116, 15)

There are ~271K records with 15 columns

For a breakdown of the memory usage, column by column, we can use memory_usage() on the whole DataFrame. The memory is reported in bytes:

>>> athletes.memory_usage(deep=True)
Index          128
ID         2168928
Name      20697535
Sex       15724728
Age        2168928
Height     2168928
Weight     2168928
Team      17734961
NOC       16266960
Games     18435888
Year       2168928
Season    17080308
City      17563109
Sport     18031019
Event     24146495
Medal      9882241
dtype: int64

The function also works for a single column:

>>> athletes['Name'].memory_usage(deep=True)
20697663

The difference between the two outputs is due to the memory taken by the index: when calling the function on the whole DataFrame, the Index has its own entry (128 bytes), while for a single column (i.e. a pandas Series) the memory used by the index is aggregated.

For an aggregated figure on the whole table, we can simply sum:

>>> athletes.memory_usage(deep=True).sum()
186408012  # roughly 178Mb

Why do we need deep=True? This flag will introspect the data deeply, reporting the actual system-level memory consumption. Without setting this flag, the function returns an estimate which could be quite far from the actual number, for example:

>>> athletes.memory_usage()
Index         128
ID        2168928
Name      2168928
Sex       2168928
Age       2168928
Height    2168928
Weight    2168928
Team      2168928
NOC       2168928
Games     2168928
Year      2168928
Season    2168928
City      2168928
Sport     2168928
Event     2168928
Medal     2168928
dtype: int64

>>> athletes['Name'].memory_usage()
2169056

Another way of getting the overall memory consumption is through the function info(), which is going to be useful because it also gives us information on the data types (dtype) used by the DataFrame. Notice again the use of deep introspection for the memory usage:

>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 177.8 MB

Notice how all the string fields are loaded as object, while all the numerical fields use a 64-bit representation, due to the architecture of the local machine — it could be 32-bit with older hardware.

Saving memory using categories

Some of the variables in our dataset are categorical, meaning they only have a handful of possible values. Rather than using a generic object for these variables, when appropriate we can use the more relevant Categorical dtype in pandas. For example, good candidates for this data type include the variables Medal, Season, or Team, amongst others.

If you don’t have a full description of the data, in order to decide which columns should be treated as categorical, you can simply observe the number of unique values to confirm this is much smaller than the dataset size:

>>> athletes['Medal'].unique()
array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)

There are only three different values, plus the null value nan.

Observe the difference in memory consumption between using object and using categories:

>>> athletes['Medal'].memory_usage(deep=True)
9882369  # 9.4+ Mb
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539  # 0.25 Mb

Besides saving memory, another advantage of using categorical data in pandas is that we can include a notion of logical order between the values, different from the lexical order.

Saving memory using smaller number representations.

Let’s look at some numerical variables, for example ID (int64), Height (float64) and Year (int64).

You can observe their range by checking the minimum and maximum values:

>>> athletes['ID'].min(), athletes['ID'].max()
(1, 135571)

The int64 dtype is able to hold numbers on a much broader range, at the price of a much bigger memory footprint:

>>> import numpy as np
>>> np.iinfo('int64')  # integer info
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Using int32 for the column ID is enough to store its values and it will save us half of the memory space:

>>> athletes['ID'].memory_usage(deep=True)
2169056
>>> athletes['ID'].astype('int32').memory_usage(deep=True)
1084592

We can do the same with floats:

>>> athletes['Height'].min(), athletes['Height'].max()
(127.0, 226.0)

In this case, a float16 is enough, and costs a quarter of the memory price:

>>> athletes['Height'].memory_usage(deep=True)
2169056
>>> athletes['Height'].astype('float16').memory_usage(deep=True)
542360

Finally, let’s look at the variable Year:

>>> athletes['Year'].min(), athletes['Year'].max()
(1896, 2016)

In this case, it looks like an int16 would be enough. On a closer look though, we can consider the Year column to be categorical as well: there’s only a handful of possible values. Let’s check the difference:

>>> athletes['Year'].memory_usage(deep=True)
2169056
>>> athletes['Year'].astype('int16').memory_usage(deep=True)
542360
>>> athletes['Year'].astype('category').memory_usage(deep=True)
272596

For this particular situation, it makes more sense to use categories rather than numbers, unless we plan on performing arithmetic operations on this column (you cannot sum or multiply two categories).

Saving memory using sparse data (when you have a lot of NaN)

The sparse dtypes in pandas are useful wen dealing with columns that have a lot of null values. Depending on your variables, you may want to consider representing your data as sparse. The info() function used earlier tells us how many non-null records we have for each column, so if that number is much lower than the size of the dataset, it means we have a lot of null values.

This is exactly the case of the Medal column that we treated as categorical earlier:

>>> athletes['Medal'].memory_usage(deep=True)
9882369
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539
>>> athletes['Medal'].astype('Sparse[category]').memory_usage(deep=True)
199067

Choosing the right dtypes when loading the data

So far we have looked at the memory usage of different dtypes, converting the columns after the dataset was loaded.

Once we have chosen the desired dtypes, we can make sure they are used when loading the data, by passing the schema as a dictionary to the read_csv() function:

>>> schema = {
... 'ID': 'int32',
... 'Height': 'float16',
... # add all your Column->dtype mappings
... }
>>> athletes = pd.read_csv('athlete_events.csv', dtype=schema)

Note: it’s not possible to use the Sparse dtype when loading the data in this way, we still need to convert the sparse columns after the dataset is loaded.

Loading only the columns you need

Depending on the application, we often don’t need the full set of columns in memory.

In our medal example, let’s say we simply want to compute the overall count of the medals per nation. For this specific use case, we only need to look at the columns Medal and NOC (National Olympic Committee).

We can pass the argument usecols to the read_csv() function:

>>> athletes = pd.read_csv('athlete_events.csv', usecols=['NOC', 'Medal'], dtype=schema)
>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   NOC     271116 non-null  category
 1   Medal   39783 non-null   category
dtypes: category(2)
memory usage: 816.3 KB

If we only want to count the medals per NOC, we can use a groupby operation on the above DataFrame:

>>> athletes.groupby('NOC')['Medal'].count()
# (long output omitted)

It’s also worth noting that for this groupby operation there is a significant speed-up when using the categorical data types.

Loading only a subset of rows

To complete the picture, the read_csv() function also offers options to limit the number of rows we’re loading. This can be useful in a few circumstances, for example when we just want to take a peek at the data without looking at the whole dataset, or when the dataset is big enough that we can answer our analytical questions with a sample.

The first case, just taking a peek at the data, is straightforward:

>>> athletes = pd.read_csv('athlete_events.csv', nrows=1000)
>>> len(athletes)
1000

Using the nrows argument, we’ll load the first N records (1000 in the example above) into the DataFrame. This can often be enough to have a first feeling for the data before digging into further analysis.

If we want to implement some random sampling instead, the read_csv() function also offers the skiprows argument. Specifically, if we pass a custom function to this argument, we can implement our sampling logic. The function takes one argument (the row number) and should return True if you want to skip that row.

In the example, we want to keep the first row because it has the column names, and we load only ~10% of the data, using the function random() which returns a random float in the [0, 1) range (if this number is greater than 0.1, we skip the row):

>>> from random import random
>>> def skip_record(row_number):
...     return random() > 0.1 and row_number > 0
... 
>>> athletes = pd.read_csv('athlete_events.csv', skiprows=skip_record)
>>> len(athletes)
27176  # 27K rows in the sample, ~271K in the full dataset

Summary

In this article we have discussed some options to save memory with pandas choosing the most appropriate data types and loading only the data that we need for our analysis.

Do you need to upskill your team in pandas? Marco runs public and private training courses on Effective pandas and other Python topics, please get in touch to know more.

Follow Marco on Twitter.

Month: September 2021

Tips for saving memory with pandas