How to determine the partition size in Apache Spark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (42/100)

Tuning the number of partitions and their size is one of the most important aspects of configuring Apache Spark. There is no easy answer.

If you choose to have many small partitions, the task distribution overhead will make everything painfully slow because your cluster will spend more time coordinating tasks and sending data between workers than doing the actual work.

On the other hand, if you use a few large partitions, some worker cores may have no tasks. You may end up waiting very long to complete the one last task if one of the workers is lagging behind the other. In the case of a worker failure, you will have to reprocess a huge amount of data.

The Apache Spark documentation recommends using 2-3 tasks per CPU core in the cluster. Therefore, if you have eight worker nodes and each node has four CPU cores, you can configure the number of partitions to a value between 64 and 96.

It is common to set the number of partitions so that the average partition size is between 100-1000 MBs. If you have 30 GB of data to process, you should have between 30 and 300 partitions.

How to configure the default parallelism in Apache Spark?

In my article about speeding up Apache Spark, I recommended having many small tasks to avoid wasting processing time when a worker node has nothing to do.

It seems that the best method to configure the parallelism parameter is to keep increasing it until you see that the distribution overhead starts slowing the whole Spark job down. If you see that the Spark cluster spends more time doing IO and reshuffling than executing the tasks, you should reduce the parallelism.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.