How to determine the partition size in Apache Spark

Tuning the number of partitions and their size is one of the most important aspects of configuring Apache Spark. There is no easy answer.

If you choose to have many small partitions, the task distribution overhead will make everything painfully slow because your cluster will spend more time coordinating tasks and sending data between workers than doing the actual work.

On the other hand, if you use a few large partitions, some worker cores may have no tasks. You may end up waiting very long to complete the one last task if one of the workers is lagging behind the other. In the case of a worker failure, you will have to reprocess a huge amount of data.

The Apache Spark documentation recommends using 2-3 tasks per CPU core in the cluster. Therefore, if you have eight worker nodes and each node has four CPU cores, you can configure the number of partitions to a value between 64 and 96.

It is common to set the number of partitions so that the average partition size is between 100-1000 MBs. If you have 30 GB of data to process, you should have between 30 and 300 partitions.

How to configure the default parallelism in Apache Spark?

In my article about speeding up Apache Spark, I recommended having many small tasks to avoid wasting processing time when a worker node has nothing to do.

It seems that the best method to configure the parallelism parameter is to keep increasing it until you see that the distribution overhead starts slowing the whole Spark job down. If you see that the Spark cluster spends more time doing IO and reshuffling than executing the tasks, you should reduce the parallelism.

Older post

How to download all available values from DynamoDB using pagination

How to use pagination to retrieve all DynamoDB values

Newer post

How to read multiple Parquet files with different schemas in Apache Spark

What to do when Apache Spark skips Parquet files with incompatible schemas