How to speed up Pandas?

Is the Pandas library too slow? Here are two methods to speed it up!

Bartosz Mikulski 21 May 2021 – 1 min read

The Pandas library uses only one core to run the operations, so there is a tremendous opportunity to speed it up even if you continue running the code on a single machine. This blog post lists three libraries you may want to try when you need your Pandas code to run faster.

Modin

Modin speeds up Pandas operations by running them on all available CPU cores. Modin re-implements (almost) all of the Pandas functions to vectorize them and distribute them across the CPUs. Because of that, the API does not change, and all we need to do is:

import modin.pandas as pd

Of course, some Pandas functions are not implemented yet, but the authors promise around 90% API coverage.

Swifter

Swifter improves only one Pandas function: the apply function, but it makes a huge difference when you use that function. Instead of using a loop to iterate over the content of the DataFrame, it supports three methods of parallelization. It can either run the code on a Dask cluster, use Modin to vectorize operations or run a custom vectorization.

The setup is quite simple:

import pandas as pd
# or
import modin.pandas as pd

import swifter

Dask

Finally, we can run a separate cluster to execute the code. In Dask, the setup is not trivial anymore because it requires installing the cluster and a few modifications in the application code. However, it may be worth the effort because we can always scale up the cluster to get better results.

Of course, if you try to speed up processing a small amount of data (small = fits in memory on a laptop), Dask will not help you. The overhead of parallelizing the tasks will most likely lead to a longer processing time than running the same code on a laptop.

AI-Powered Topic Modeling: Using Word Embeddings and Clustering for Document Analysis

Explore the seamless integration of artificial intelligence with classical machine learning techniques for effective topic modeling and document clustering. Learn how word embeddings enable higher accuracy, semantic context preservation, and robust results.
Published on: 30 Sep 2023

How to speed up Pandas?

Modin

Swifter

Dask

Data versioning with LakeFS

Multimodel deployment in Sagemaker Endpoints

How to speed up Pandas?

Modin

Swifter

Dask

Data versioning with LakeFS

Multimodel deployment in Sagemaker Endpoints

Related Posts

AI-Powered Topic Modeling: Using Word Embeddings and Clustering for Document Analysis

How to add custom preprocessing code to a Sagemaker Endpoint running a Tensorflow model

How to A/B test Tensorflow models using Sagemaker Endpoints