Speed up counting the distinct elements in a Spark DataFrame

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (46/100)

Why does counting the unique elements in Spark take so long? Let’s look at the classical example used to demonstrate big data problems: counting words in a book.

When we use Spark to do that, it calculates the number of unique words in every partition, reshuffles the data using the words as the partitioning keys (so all counts of a particular word end up in the same cluster), and sums the partial aggregates to get the final result.

We immediately see that exchanging partial counts between nodes in the cluster will slow down the whole operation. It is also quite memory-intensive because we must store a counter for every word in the text.

Suppose we don’t need the accurate count, and an approximation is good enough. In that case, we can count the unique values using the approx_count_distinct function (there is also a version that lets you define the maximal approximation error).

When we use that function, Spark counts the distinct elements using a variant of the HyperLogLog algorithm.

1
2
3
import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("column_name"))

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.