What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

In this article, I will explain the difference between the cube and rollup operators, and show how to use them in Apache Spark (or PySpark).

Both functions are used to group data by multiple columns and calculate an aggregation of another column. cube creates combinations of all values in all listed columns. rollup returns a hierarchy of values using the given columns starting from the first given column.

For example, if I was grouping values by columns year, month, and day to calculate the number of rows, cube would return the following combinations. In the second column, I explain what the corresponding group by statement is.

CUBE GROUP BY
year, month, day SELECT COUNT(*) FROM table GROUP BY year, month, day
year, month SELECT COUNT(*) FROM table GROUP BY year, month
year, day SELECT COUNT(*) FROM table GROUP BY year, day
year SELECT COUNT(*) FROM table GROUP BY year
month, day SELECT COUNT(*) FROM table GROUP BY month, day
month SELECT COUNT(*) FROM table GROUP BY month
day SELECT COUNT(*) FROM table GROUP BY day
null, null, null SELECT COUNT(*) FROM table

If I used the rollup function, the grouping would look like this:

ROLLUP GROUP BY
year, month, day SELECT COUNT(*) FROM table GROUP BY year, month, day
year, month SELECT COUNT(*) FROM table GROUP BY year, month
year SELECT COUNT(*) FROM table GROUP BY year
null SELECT COUNT(*) FROM table

Example in PySpark

Suppose that I have a DataFrame representing sales of financial products that contain the product category, the name of the salesperson, and the number of products sold.

+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+

When I use the cube function to calculate the sum of sales grouped by category and name, the code looks like this:

df.cube('category', 'name').agg(sum('how_many'))

As a result, I get a DataFrame of all combinations of the category and the name. It also includes the combinations where any (or both) of those columns are not defined:

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|           null|  Grażyna|            7|
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|           null|Sebastian|            9|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
|           null|   Janusz|           20|
+---------------+---------+-------------+

When I am interested in sales grouped primarily by the category, but I also want to know who is the best salesperson in each category, I should use the rollup function:

df.rollup('category', 'name').agg(sum('how_many'))
+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
+---------------+---------+-------------+
Older post

How to concatenate columns in a PySpark DataFrame

How to use the concat and concat_ws functions to merge multiple columns into one in PySpark

Newer post

How to flatten a struct in a Spark DataFrame?

How to convert DataFrame fields into separate columns.