How to speed up Pandas?

The Pandas library uses only one core to run the operations, so there is a tremendous opportunity to speed it up even if you continue running the code on a single machine. This blog post lists three libraries you may want to try when you need your Pandas code to run faster.

Modin

Modin speeds up Pandas operations by running them on all available CPU cores. Modin re-implements (almost) all of the Pandas functions to vectorize them and distribute them across the CPUs. Because of that, the API does not change, and all we need to do is:

1
import modin.pandas as pd

Of course, some Pandas functions are not implemented yet, but the authors promise around 90% API coverage.

Swifter

Swifter improves only one Pandas function: the apply function, but it makes a huge difference when you use that function. Instead of using a loop to iterate over the content of the DataFrame, it supports three methods of parallelization. It can either run the code on a Dask cluster, use Modin to vectorize operations or run a custom vectorization.

The setup is quite simple:

1
2
3
4
5
import pandas as pd
# or 
import modin.pandas as pd

import swifter

Dask

Finally, we can run a separate cluster to execute the code. In Dask, the setup is not trivial anymore because it requires installing the cluster and a few modifications in the application code. However, it may be worth the effort because we can always scale up the cluster to get better results.

Of course, if you try to speed up processing a small amount of data (small = fits in memory on a laptop), Dask will not help you. The overhead of parallelizing the tasks will most likely lead to a longer processing time than running the same code on a laptop.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.