What is the difference between repartition and coalesce in Apache Spark?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (3/100)

In Spark, there are two ways of explicitly changing the number of partitions. We can use either the repartition function or coalesce. Both accept a numeric parameter, which tells Spark how many partitions we want to have. I will explain the difference between those two functions and when we should use them.

Repartition

First of all, the repartition function that performs operations on Spark Datasets comes in three versions. There is a function that accepts only the number of partitions we want to get. In this case, Spark performs RoundRobinPartitioning to uniformly distribute the whole Dataset across the requested number of partitions.

If we specify both the expression and the number of partitions, Spark does hashpartitioning, which produces the requested number of partitions in which data is partitioned using the given expression.

Note that, the result will most likely not be uniformly distributed. We may end up with partitions that contain significantly more data if the partitioning value is skewed.

In addition to that, we may use the function that accepts only the partitioning expression and uses the sql.shuffle.partitions parameter as the number of partitions.

We see that the repartition function can both increase and decrease the number of partitions. Suppose we do not use the partitioning expression. In that case, we get partitions that distribute the workload equally across all worker nodes (unless we use another operation that implicitly repartitions the data using hashpartitioning).



Coalesce

Coalesce is a little bit different. It accepts only one parameter - there is no way to use the partitioning expression, and it can only decrease the number of partitions. It works this way because we should use coalesce only to combine the existing partitions. It merges the data by draining existing partitions into others and removing the empty partitions.

Note that if we specify the number of partitions larger than the current partitioning, coalesce will not do anything.

When to Use Coalesce Instead of Repartition?

According to the documentation, it is better to run repartition instead of coalesce when we want to do “drastic coalesce” (for example, merge everything into one partition). When we do it, the upstream calculations execute in parallel, so the entire computation is distributed across a larger number of workers.

I can think of only one use case when I am sure that the coalesce function is a better choice. If I have just used where or filter, and I think that some of the partitions may be “almost empty”, I will want to move the data around to have a similar number of rows in every partition. coalesce will not give me uniformed distribution, but it is going to be faster then repartition, and it should be good enough in this case.

Is Coalesce Worth Consideration In Practice?

Most likely no. Usually, the Spark jobs are full of the group by and join operations, so avoiding full reshuffles using coalesce in a few places will not make a huge difference.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group