What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (9/100)

In this article, I will explain the difference between the cube and rollup operators, and show how to use them in Apache Spark (or PySpark).

Both functions are used to group data by multiple columns and calculate an aggregation of another column. cube creates combinations of all values in all listed columns. rollup returns a hierarchy of values using the given columns starting from the first given column.

For example, if I was grouping values by columns year, month, and day to calculate the number of rows, cube would return the following combinations. In the second column, I explain what the corresponding group by statement is.

CUBE GROUP BY
year, month, day SELECT COUNT(*) FROM table GROUP BY year, month, day
year, month SELECT COUNT(*) FROM table GROUP BY year, month
year, day SELECT COUNT(*) FROM table GROUP BY year, day
year SELECT COUNT(*) FROM table GROUP BY year
month, day SELECT COUNT(*) FROM table GROUP BY month, day
month SELECT COUNT(*) FROM table GROUP BY month
day SELECT COUNT(*) FROM table GROUP BY day
null, null, null SELECT COUNT(*) FROM table

If I used the rollup function, the grouping would look like this:

ROLLUP GROUP BY
year, month, day SELECT COUNT(*) FROM table GROUP BY year, month, day
year, month SELECT COUNT(*) FROM table GROUP BY year, month
year SELECT COUNT(*) FROM table GROUP BY year
null SELECT COUNT(*) FROM table


Example in PySpark

Suppose that I have a DataFrame representing sales of financial products that contain the product category, the name of the salesperson, and the number of products sold.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+

When I use the cube function to calculate the sum of sales grouped by category and name, the code looks like this:

1
df.cube('category', 'name').agg(sum('how_many'))

As a result, I get a DataFrame of all combinations of the category and the name. It also includes the combinations where any (or both) of those columns are not defined:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|           null|  Grażyna|            7|
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|           null|Sebastian|            9|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
|           null|   Janusz|           20|
+---------------+---------+-------------+

When I am interested in sales grouped primarily by the category, but I also want to know who is the best salesperson in each category, I should use the rollup function:

1
df.rollup('category', 'name').agg(sum('how_many'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
+---------------+---------+-------------+

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group