How to determine the partition size in Apache Spark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (42/100)

Tuning the number of partitions and their size is one of the most important aspects of configuring Apache Spark. There is no easy answer.

If you choose to have many small partitions, the task distribution overhead will make everything painfully slow because your cluster will spend more time coordinating tasks and sending data between workers than doing the actual work.

On the other hand, if you use a few large partitions, some worker cores may have no tasks. You may end up waiting very long to complete the one last task if one of the workers is lagging behind the other. In the case of a worker failure, you will have to reprocess a huge amount of data.

The Apache Spark documentation recommends using 2-3 tasks per CPU core in the cluster. Therefore, if you have eight worker nodes and each node has four CPU cores, you can configure the number of partitions to a value between 64 and 96.

It is common to set the number of partitions so that the average partition size is between 100-1000 MBs. If you have 30 GB of data to process, you should have between 30 and 300 partitions.

Subscribe to the newsletter and join the free email course.

How to configure the default parallelism in Apache Spark?

In my article about speeding up Apache Spark, I recommended having many small tasks to avoid wasting processing time when a worker node has nothing to do.

It seems that the best method to configure the parallelism parameter is to keep increasing it until you see that the distribution overhead starts slowing the whole Spark job down. If you see that the Spark cluster spends more time doing IO and reshuffling than executing the tasks, you should reduce the parallelism.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.