How to make a pivot table in AWS Athena or PrestoSQL

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (4/100)

In the previous article, I showed how to make a pivot DataFrame using Apache Spark. This blog post shows the same operation in AWS Athena. Because Athena is built on top of PrestoDB, this method will also work in PrestoDB and PrestoSQL.

Let’s use the same data as in the previous text. We have the following table that describes the sales of financial products. We have the category, the name of the salesperson, and the number of products sold.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+

We want to create a pivot table that contains the category, and separate columns for every person. The rows will contain the sum of products sold by each person in every product category.

There Is No Pivot Function in Athena

The first problem arises because Athena and PrestoSQL don’t have a PIVOT function. We need to use weird workaround. First, we have to group all salespeople into an array and store those arrays inside a map.

1
2
3
SELECT category, multimap_agg(name, how_many) as agg
FROM sales
GROUP BY category

This query creates an output which should look like this:

1
2
3
4
5
6
7
8
9
+---------------+-----------------------------------------------+
|       category|                                           agg |
+---------------+-----------------------------------------------+
|      insurance|                   {Janusz=[0], Grażyna=[2, 3]}|
|savings account|  {Janusz=[5], Grażyna=[1, 1], Sebastian=[0, 2]|
|    credit card|                          {Sebastian=[0, 2, 1]}|
|       mortgage|                    {Janusz=[2], Sebastian=[4]}|
|   term deposit|                                 {Janusz=[4, 9]|
+---------------+-----------------------------------------------+

Now, we have to use the reduce function to calculate the sum of every person’s sales. The previous SELECT statement becomes a subquery of the following SQL:

1
2
3
4
5
6
7
8
9
10
SELECT
  category,
  reduce(agg['Grażyna'], 0.0, (s, x) -> s + x, s -> s) AS "Grażyna",
  reduce(agg['Janusz'], 0.0, (s, x) -> s + x, s -> s) AS "Janusz",
  reduce(agg['Sebastian'], 0.0, (s, x) -> s + x, s -> s) AS "Sebastian"
FROM (
	SELECT category, multimap_agg(name, how_many) as agg
	FROM sales
	GROUP BY category
)

Finally, we get the result:

1
2
3
4
5
6
7
8
9
+---------------+-------+------+---------+
|       category|Grażyna|Janusz|Sebastian|
+---------------+-------+------+---------+
|savings account|      2|     5|        2|
|   term deposit|   null|    13|     null|
|       mortgage|   null|     2|        4|
|    credit card|   null|  null|        3|
|      insurance|      5|     0|     null|
+---------------+-------+------+---------+


Why There Is No Pivot Function

Unfortunately, it is not possible to automatically generate the column names, and we have to list all of the values we want and use the reduce function to extract them. In PrestoSQL, the schema must be available during query planning, so the analyzer can access all the information it needs.

In the case of the pivot function, the schema depends on the data in the table, so PrestoSQL (and Athena) would have to execute half of the query, look at the available values, and rewrite the execution plan of the subsequent operations. This limitation is also the reason why the pivot function does not exist.

For a detailed explanation, please take a look at the discussion on the PrestoSQL GitHub page: https://github.com/prestosql/presto/issues/1206


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group