How to determine the partition size in Apache Spark

How to choose the proper partition size and the number of partitions to run an Apache Spark job

Bartosz Mikulski 03 Nov 2020 – 1 min read

Tuning the number of partitions and their size is one of the most important aspects of configuring Apache Spark. There is no easy answer.

If you choose to have many small partitions, the task distribution overhead will make everything painfully slow because your cluster will spend more time coordinating tasks and sending data between workers than doing the actual work.

On the other hand, if you use a few large partitions, some worker cores may have no tasks. You may end up waiting very long to complete the one last task if one of the workers is lagging behind the other. In the case of a worker failure, you will have to reprocess a huge amount of data.

What is the recommended number of partitions?

The Apache Spark documentation recommends using 2-3 tasks per CPU core in the cluster. Therefore, if you have eight worker nodes and each node has four CPU cores, you can configure the number of partitions to a value between 64 and 96.

What is the recommended partition size?

It is common to set the number of partitions so that the average partition size is between 100-1000 MBs. If you have 30 GB of data to process, you should have between 30 and 300 partitions.

How to configure the default parallelism in Apache Spark?

In my article about speeding up Apache Spark, I recommended having many small tasks to avoid wasting processing time when a worker node has nothing to do.

It seems that the best method to configure the parallelism parameter is to keep increasing it until you see that the distribution overhead starts slowing the whole Spark job down. If you see that the Spark cluster spends more time doing IO and reshuffling than executing the tasks, you should reduce the parallelism.

How to determine the partition size in Apache Spark

What is the recommended number of partitions?

What is the recommended partition size?

How to configure the default parallelism in Apache Spark?

How to download all available values from DynamoDB using pagination

How to read multiple Parquet files with different schemas in Apache Spark

How to determine the partition size in Apache Spark

What is the recommended number of partitions?

What is the recommended partition size?

How to configure the default parallelism in Apache Spark?

How to download all available values from DynamoDB using pagination

How to read multiple Parquet files with different schemas in Apache Spark

Related Posts

What is shuffling in Apache Spark, and when does it happen?

How to measure Spark performance and gather metrics about written data

How to combine two DataFrames with no common columns in Apache Spark