Speed up counting the distinct elements in a Spark DataFrame
Why does counting the unique elements in Spark take so long? Let’s look at the classical example used to demonstrate big data problems: counting words in a book.
When we use Spark to do that, it calculates the number of unique words in every partition, reshuffles the data using the words as the partitioning keys (so all counts of a particular word end up in the same cluster), and sums the partial aggregates to get the final result.
We immediately see that exchanging partial counts between nodes in the cluster will slow down the whole operation. It is also quite memory-intensive because we must store a counter for every word in the text.
Suppose we don’t need the accurate count, and an approximation is good enough. In that case, we can count the unique values using the
approx_count_distinct function (there is also a version that lets you define the maximal approximation error).
When we use that function, Spark counts the distinct elements using a variant of the HyperLogLog algorithm.
1 2 3 import org.apache.spark.sql.functions.approx_count_distinct df.agg(approx_count_distinct("column_name"))
You may also like
- How to derive multiple columns from a single column in a PySpark DataFrame
- How to add dependencies as jar files or Python scripts to PySpark
- Testing data products: BDD for data engineers
- Use regexp_replace to replace a matched string with a value of another column in PySpark
- How to write to a SQL database using JDBC in PySpark