Best Practices – Marco Bonzanini

In this article you’ll find some tips to reduce the amount of RAM used when working with pandas, the fundamental Python library for data analysis and data manipulation.

When dealing with large(ish) datasets, reducing the memory usage is something you need to consider if you’re stretching to the limits of using a single machine. For example, when you try to load some data from a big CSV file, you want to avoid your program crashing with a MemoryError. These tips can also help speeding up some downstream analytical queries.

The overall strategy boils down to choosing the right data types and loading only what you need. In this article you’ll learn about:

Finding out how much memory is used
Saving memory using categories
Saving memory using smaller number representations
Saving memory using sparse data (when you have a lot of NaN)
Choosing the right dtypes when loading the data
Loading only the columns you need
Loading only a subset of rows

Finding out how much memory is used

First, let’s look into some simple steps to observe how much memory is taken by a pandas DataFrame.

For the examples I’m using a dataset about Olympic history from Kaggle. The dataset is in CSV format and takes roughly 40Mb on disk.

>>> import pandas as pd
>>> athletes = pd.read_csv('athlete_events.csv')
>>> athletes.shape
(271116, 15)

There are ~271K records with 15 columns

For a breakdown of the memory usage, column by column, we can use memory_usage() on the whole DataFrame. The memory is reported in bytes:

>>> athletes.memory_usage(deep=True)
Index          128
ID         2168928
Name      20697535
Sex       15724728
Age        2168928
Height     2168928
Weight     2168928
Team      17734961
NOC       16266960
Games     18435888
Year       2168928
Season    17080308
City      17563109
Sport     18031019
Event     24146495
Medal      9882241
dtype: int64

The function also works for a single column:

>>> athletes['Name'].memory_usage(deep=True)
20697663

The difference between the two outputs is due to the memory taken by the index: when calling the function on the whole DataFrame, the Index has its own entry (128 bytes), while for a single column (i.e. a pandas Series) the memory used by the index is aggregated.

For an aggregated figure on the whole table, we can simply sum:

>>> athletes.memory_usage(deep=True).sum()
186408012  # roughly 178Mb

Why do we need deep=True? This flag will introspect the data deeply, reporting the actual system-level memory consumption. Without setting this flag, the function returns an estimate which could be quite far from the actual number, for example:

>>> athletes.memory_usage()
Index         128
ID        2168928
Name      2168928
Sex       2168928
Age       2168928
Height    2168928
Weight    2168928
Team      2168928
NOC       2168928
Games     2168928
Year      2168928
Season    2168928
City      2168928
Sport     2168928
Event     2168928
Medal     2168928
dtype: int64

>>> athletes['Name'].memory_usage()
2169056

Another way of getting the overall memory consumption is through the function info(), which is going to be useful because it also gives us information on the data types (dtype) used by the DataFrame. Notice again the use of deep introspection for the memory usage:

>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 177.8 MB

Notice how all the string fields are loaded as object, while all the numerical fields use a 64-bit representation, due to the architecture of the local machine — it could be 32-bit with older hardware.

Saving memory using categories

Some of the variables in our dataset are categorical, meaning they only have a handful of possible values. Rather than using a generic object for these variables, when appropriate we can use the more relevant Categorical dtype in pandas. For example, good candidates for this data type include the variables Medal, Season, or Team, amongst others.

If you don’t have a full description of the data, in order to decide which columns should be treated as categorical, you can simply observe the number of unique values to confirm this is much smaller than the dataset size:

>>> athletes['Medal'].unique()
array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)

There are only three different values, plus the null value nan.

Observe the difference in memory consumption between using object and using categories:

>>> athletes['Medal'].memory_usage(deep=True)
9882369  # 9.4+ Mb
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539  # 0.25 Mb

Besides saving memory, another advantage of using categorical data in pandas is that we can include a notion of logical order between the values, different from the lexical order.

Saving memory using smaller number representations.

Let’s look at some numerical variables, for example ID (int64), Height (float64) and Year (int64).

You can observe their range by checking the minimum and maximum values:

>>> athletes['ID'].min(), athletes['ID'].max()
(1, 135571)

The int64 dtype is able to hold numbers on a much broader range, at the price of a much bigger memory footprint:

>>> import numpy as np
>>> np.iinfo('int64')  # integer info
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Using int32 for the column ID is enough to store its values and it will save us half of the memory space:

>>> athletes['ID'].memory_usage(deep=True)
2169056
>>> athletes['ID'].astype('int32').memory_usage(deep=True)
1084592

We can do the same with floats:

>>> athletes['Height'].min(), athletes['Height'].max()
(127.0, 226.0)

In this case, a float16 is enough, and costs a quarter of the memory price:

>>> athletes['Height'].memory_usage(deep=True)
2169056
>>> athletes['Height'].astype('float16').memory_usage(deep=True)
542360

Finally, let’s look at the variable Year:

>>> athletes['Year'].min(), athletes['Year'].max()
(1896, 2016)

In this case, it looks like an int16 would be enough. On a closer look though, we can consider the Year column to be categorical as well: there’s only a handful of possible values. Let’s check the difference:

>>> athletes['Year'].memory_usage(deep=True)
2169056
>>> athletes['Year'].astype('int16').memory_usage(deep=True)
542360
>>> athletes['Year'].astype('category').memory_usage(deep=True)
272596

For this particular situation, it makes more sense to use categories rather than numbers, unless we plan on performing arithmetic operations on this column (you cannot sum or multiply two categories).

Saving memory using sparse data (when you have a lot of NaN)

The sparse dtypes in pandas are useful wen dealing with columns that have a lot of null values. Depending on your variables, you may want to consider representing your data as sparse. The info() function used earlier tells us how many non-null records we have for each column, so if that number is much lower than the size of the dataset, it means we have a lot of null values.

This is exactly the case of the Medal column that we treated as categorical earlier:

>>> athletes['Medal'].memory_usage(deep=True)
9882369
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539
>>> athletes['Medal'].astype('Sparse[category]').memory_usage(deep=True)
199067

Choosing the right dtypes when loading the data

So far we have looked at the memory usage of different dtypes, converting the columns after the dataset was loaded.

Once we have chosen the desired dtypes, we can make sure they are used when loading the data, by passing the schema as a dictionary to the read_csv() function:

>>> schema = {
... 'ID': 'int32',
... 'Height': 'float16',
... # add all your Column->dtype mappings
... }
>>> athletes = pd.read_csv('athlete_events.csv', dtype=schema)

Note: it’s not possible to use the Sparse dtype when loading the data in this way, we still need to convert the sparse columns after the dataset is loaded.

Loading only the columns you need

Depending on the application, we often don’t need the full set of columns in memory.

In our medal example, let’s say we simply want to compute the overall count of the medals per nation. For this specific use case, we only need to look at the columns Medal and NOC (National Olympic Committee).

We can pass the argument usecols to the read_csv() function:

>>> athletes = pd.read_csv('athlete_events.csv', usecols=['NOC', 'Medal'], dtype=schema)
>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   NOC     271116 non-null  category
 1   Medal   39783 non-null   category
dtypes: category(2)
memory usage: 816.3 KB

If we only want to count the medals per NOC, we can use a groupby operation on the above DataFrame:

>>> athletes.groupby('NOC')['Medal'].count()
# (long output omitted)

It’s also worth noting that for this groupby operation there is a significant speed-up when using the categorical data types.

Loading only a subset of rows

To complete the picture, the read_csv() function also offers options to limit the number of rows we’re loading. This can be useful in a few circumstances, for example when we just want to take a peek at the data without looking at the whole dataset, or when the dataset is big enough that we can answer our analytical questions with a sample.

The first case, just taking a peek at the data, is straightforward:

>>> athletes = pd.read_csv('athlete_events.csv', nrows=1000)
>>> len(athletes)
1000

Using the nrows argument, we’ll load the first N records (1000 in the example above) into the DataFrame. This can often be enough to have a first feeling for the data before digging into further analysis.

If we want to implement some random sampling instead, the read_csv() function also offers the skiprows argument. Specifically, if we pass a custom function to this argument, we can implement our sampling logic. The function takes one argument (the row number) and should return True if you want to skip that row.

In the example, we want to keep the first row because it has the column names, and we load only ~10% of the data, using the function random() which returns a random float in the [0, 1) range (if this number is greater than 0.1, we skip the row):

>>> from random import random
>>> def skip_record(row_number):
...     return random() > 0.1 and row_number > 0
... 
>>> athletes = pd.read_csv('athlete_events.csv', skiprows=skip_record)
>>> len(athletes)
27176  # 27K rows in the sample, ~271K in the full dataset

Summary

In this article we have discussed some options to save memory with pandas choosing the most appropriate data types and loading only the data that we need for our analysis.

Do you need to upskill your team in pandas? Marco runs public and private training courses on Effective pandas and other Python topics, please get in touch to know more.

Follow Marco on Twitter.

This article contains some notes about the development of Python modules and packages, as well as brief overview on how to distribute a package in order to make it easy to install via pip.

Modules vs Packages in Python

Firstly, let’s start from the distinction between modules and packages, which is something sligthly different from language to language.

In Python, a simple source file containing the definitions of functions, classes and variables is a module. Once your application grows, you can organise your code into different files (modules) so that you can keep your sources tidy and clean, and you can re-use some of the functionalities in other applications.

On the other side, a package is a folder containing a __init__.py file, as well as other different Python source files. Typically a package contains several modules and sub-packages.

For example, you could have a foobar.py file where you declare a hello() function. You can re-use the function in different ways:

# import whole module and use its namespace
import foobar
foobar.hello()
# import specific function in local namespace
from foobar import hello
hello()
# import specific function in local namespace, create an alias
from foobar import hello as hi
hi()
# import all module declarations in local namespace
from foobar import *
hello()

The last option is usually considered sub-optimal, because you’re going to pollute the local namespace causing potential name conflicts. For example, assuming you imported some maths libraries and you’re using the log() function, is it coming from math.log() or numpy.log()? I usually aim for clarity when I choose which option is more suitable for a particular case.

Similarly, you can import a package, a particular definition, a sub-package, etc.

Notice: the import command will look for modules and packages in the working directory as well as folders declared in the Python path. You can find out where your libraries are stored by looking at:

import sys
print(sys.path)

The Python path can be extended with user-specific folders by overriding the $PYTHONPATH environment variable.

This means that if you want to make a particular module/package available to an application, it must either be in the working directory or in one of the folders dedicated to Python libraries. The latter option is usually achieved via the creation of an installation script.

Setup Tools and setup.py

As part of the Python Standard Library, the main component to develop installation scripts is distutils. However, to overcome its limitations, setuptools is now the recommended options.

By creating a setup.py script in the parent folder of your package, you can make it easy to install if you share it via Github or if you make it available for pip.

The basic structure of setup.py looks like:

from setuptools import setup

long_description = 'Looong description of your package, e.g. a README file'

setup(name='yourpackage', # name your package
      packages=['yourpackage'], # same name as above
      version='1.0.0', 
      description='Short description of your package',
      long_description=long_description,
      url='http://example.org/yourpackage',
      author='Your Name',
      author_email='your.name@example.org',
      license='MIT') # choose the appropriate license

The source code of the package should be put into a folder names with the package name itself, while the setup script should be in the parent directory together with the documentation. This is an example of source structure:

.
├── LICENSE
├── README.rst
├── setup.py
└── yourpackage
    ├── __init__.py
    ├── some_module.py
    ├── other_module.py
    └── sub_package
        ├── __init__.py
        └── more_modules.py

The LICENSE and README.rst files are documentation, the setup.py file is the installation script as above, while the whole source code of the package with its components is under the yourpackage folder.

You could install the package and make it available for any of your Python apps with:

python setup.py install

If you publish the above structure on a public repository, e.g. on Gibhub, anyone could easily install it with:

git clone https://www.github.com/yourname/yourpackage
cd yourpackage
python setup.py install

PyPI as Public Repo

PyPI, the Python Package Index, also known as the CheeseShop, is where developers can publish their Python packages to make them available for easy installation via pip.

Once your package is ready to be published, you’ll need to register your account on PyPI. You should also register your new package on PyPI: you can do so using the web form on the PyPI website.

Once your account is ready, create a file called .pypirc in your home folder:

$ cat ~/.pypirc 

[distutils]
index-servers=pypi

[pypi]
repository = https://pypi.python.org/pypi
username = your-username
password = your-password

Now you’re ready to push your package to the publish index:

python setup.py sdist upload

The sdist command will create the package to distribute, while the upload command will push it to the public repository using the information that you stored in ~/.pypirc.

At this point, you can install your brand new Python package on any machine by typing:

pip install yourpackage

Conclusion

Organising your code into modules and packages will help keeping your codebase clean. In particular, packing your code into meaningful packages will improve code re-use. There are only a few simply steps to follow in order to create a Python package that can be easily distributed, and if you decide to do so, the Python Package Index is the obvious choice.

Category: Best Practices

Tips for saving memory with pandas