How to speed up Pandas?

The Pandas library uses only one core to run the operations, so there is a tremendous opportunity to speed it up even if you continue running the code on a single machine. This blog post lists three libraries you may want to try when you need your Pandas code to run faster.

Modin

Modin speeds up Pandas operations by running them on all available CPU cores. Modin re-implements (almost) all of the Pandas functions to vectorize them and distribute them across the CPUs. Because of that, the API does not change, and all we need to do is:

import modin.pandas as pd

Of course, some Pandas functions are not implemented yet, but the authors promise around 90% API coverage.

Swifter

Swifter improves only one Pandas function: the apply function, but it makes a huge difference when you use that function. Instead of using a loop to iterate over the content of the DataFrame, it supports three methods of parallelization. It can either run the code on a Dask cluster, use Modin to vectorize operations or run a custom vectorization.

The setup is quite simple:

import pandas as pd
# or
import modin.pandas as pd

import swifter

Dask

Finally, we can run a separate cluster to execute the code. In Dask, the setup is not trivial anymore because it requires installing the cluster and a few modifications in the application code. However, it may be worth the effort because we can always scale up the cluster to get better results.

Of course, if you try to speed up processing a small amount of data (small = fits in memory on a laptop), Dask will not help you. The overhead of parallelizing the tasks will most likely lead to a longer processing time than running the same code on a laptop.

Older post

Data versioning with LakeFS

Why you should use LakeFS to build a data lake that supports data versioning

Newer post

Multimodel deployment in Sagemaker Endpoints

How to deploy multiple models in a single Sagemaker Endpoint?