How to determine the partition size in Apache Spark
Tuning the number of partitions and their size is one of the most important aspects of configuring Apache Spark. There is no easy answer.
If you choose to have many small partitions, the task distribution overhead will make everything painfully slow because your cluster will spend more time coordinating tasks and sending data between workers than doing the actual work.
On the other hand, if you use a few large partitions, some worker cores may have no tasks. You may end up waiting very long to complete the one last task if one of the workers is lagging behind the other. In the case of a worker failure, you will have to reprocess a huge amount of data.
What is the recommended number of partitions?
The Apache Spark documentation recommends using 2-3 tasks per CPU core in the cluster. Therefore, if you have eight worker nodes and each node has four CPU cores, you can configure the number of partitions to a value between 64 and 96.
What is the recommended partition size?
It is common to set the number of partitions so that the average partition size is between 100-1000 MBs. If you have 30 GB of data to process, you should have between 30 and 300 partitions.
How to configure the default parallelism in Apache Spark?
In my article about speeding up Apache Spark, I recommended having many small tasks to avoid wasting processing time when a worker node has nothing to do.
It seems that the best method to configure the parallelism parameter is to keep increasing it until you see that the distribution overhead starts slowing the whole Spark job down. If you see that the Spark cluster spends more time doing IO and reshuffling than executing the tasks, you should reduce the parallelism.
You may also like
- Broadcast variables and broadcast joins in Apache Spark
- How to combine two DataFrames with no common columns in Apache Spark
- How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
- Check-Engine - data quality validation for PySpark 3.0.0
- How to concatenate columns in a PySpark DataFrame