What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

In this article, I will explain the difference between the cube and rollup operators, and show how to use them in Apache Spark (or PySpark).

Both functions are used to group data by multiple columns and calculate an aggregation of another column. cube creates combinations of all values in all listed columns. rollup returns a hierarchy of values using the given columns starting from the first given column.

For example, if I was grouping values by columns year, month, and day to calculate the number of rows, cube would return the following combinations. In the second column, I explain what the corresponding group by statement is.

CUBE	GROUP BY
year, month, day	SELECT COUNT(*) FROM table GROUP BY year, month, day
year, month	SELECT COUNT(*) FROM table GROUP BY year, month
year, day	SELECT COUNT(*) FROM table GROUP BY year, day
year	SELECT COUNT(*) FROM table GROUP BY year
month, day	SELECT COUNT(*) FROM table GROUP BY month, day
month	SELECT COUNT(*) FROM table GROUP BY month
day	SELECT COUNT(*) FROM table GROUP BY day
null, null, null	SELECT COUNT(*) FROM table

If I used the rollup function, the grouping would look like this:

ROLLUP	GROUP BY
year, month, day	SELECT COUNT(*) FROM table GROUP BY year, month, day
year, month	SELECT COUNT(*) FROM table GROUP BY year, month
year	SELECT COUNT(*) FROM table GROUP BY year
null	SELECT COUNT(*) FROM table

Example in PySpark

Suppose that I have a DataFrame representing sales of financial products that contain the product category, the name of the salesperson, and the number of products sold.

+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+

When I use the cube function to calculate the sum of sales grouped by category and name, the code looks like this:

df.cube('category', 'name').agg(sum('how_many'))

As a result, I get a DataFrame of all combinations of the category and the name. It also includes the combinations where any (or both) of those columns are not defined:

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|           null|  Grażyna|            7|
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|           null|Sebastian|            9|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
|           null|   Janusz|           20|
+---------------+---------+-------------+

When I am interested in sales grouped primarily by the category, but I also want to know who is the best salesperson in each category, I should use the rollup function:

df.rollup('category', 'name').agg(sum('how_many'))

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
+---------------+---------+-------------+

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Example in PySpark

How to concatenate columns in a PySpark DataFrame

How to flatten a struct in a Spark DataFrame?

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Example in PySpark

How to concatenate columns in a PySpark DataFrame

How to flatten a struct in a Spark DataFrame?

Related Posts

What is shuffling in Apache Spark, and when does it happen?

How to measure Spark performance and gather metrics about written data

Testing data products: BDD for data engineers