What is the difference between repartition and coalesce in Apache Spark?
In Spark, there are two ways of explicitly changing the number of partitions. We can use either the
repartition function or
coalesce. Both accept a numeric parameter, which tells Spark how many partitions we want to have. I will explain the difference between those two functions and when we should use them.
First of all, the
repartition function that performs operations on Spark
Datasets comes in three versions. There is a function that accepts only the number of partitions we want to get. In this case, Spark performs
RoundRobinPartitioning to uniformly distribute the whole
Dataset across the requested number of partitions.
If we specify both the expression and the number of partitions, Spark does
hashpartitioning, which produces the requested number of partitions in which data is partitioned using the given expression.
Note that, the result will most likely not be uniformly distributed. We may end up with partitions that contain significantly more data if the partitioning value is skewed.
In addition to that, we may use the function that accepts only the partitioning expression and uses the
sql.shuffle.partitions parameter as the number of partitions.
We see that the
repartition function can both increase and decrease the number of partitions. Suppose we do not use the partitioning expression. In that case, we get partitions that distribute the workload equally across all worker nodes (unless we use another operation that implicitly repartitions the data using
Coalesce is a little bit different. It accepts only one parameter - there is no way to use the partitioning expression, and it can only decrease the number of partitions. It works this way because we should use
coalesce only to combine the existing partitions. It merges the data by draining existing partitions into others and removing the empty partitions.
Note that if we specify the number of partitions larger than the current partitioning, coalesce will not do anything.
When to Use Coalesce Instead of Repartition?
According to the documentation, it is better to run
repartition instead of
coalesce when we want to do “drastic coalesce” (for example, merge everything into one partition). When we do it, the upstream calculations execute in parallel, so the entire computation is distributed across a larger number of workers.
I can think of only one use case when I am sure that the coalesce function is a better choice. If I have just used
filter, and I think that some of the partitions may be “almost empty”, I will want to move the data around to have a similar number of rows in every partition.
coalesce will not give me uniformed distribution, but it is going to be faster then
repartition, and it should be good enough in this case.
Is Coalesce Worth Consideration In Practice?
Most likely no. Usually, the Spark jobs are full of the
group by and
join operations, so avoiding full reshuffles using coalesce in a few places will not make a huge difference.
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How to flatten a struct in a Spark DataFrame?
- How to use the window function to get a single row from each group in Apache Spark
- How to read multiple Parquet files with different schemas in Apache Spark
- What is shuffling in Apache Spark, and when does it happen?
- How to determine the partition size in Apache Spark
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz