How to make a pivot table in AWS Athena or PrestoSQL

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (4/100)

In the previous article, I showed how to make a pivot DataFrame using Apache Spark. This blog post shows the same operation in AWS Athena. Because Athena is built on top of PrestoDB, this method will also work in PrestoDB and PrestoSQL.

Let’s use the same data as in the previous text. We have the following table that describes the sales of financial products. We have the category, the name of the salesperson, and the number of products sold.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+

We want to create a pivot table that contains the category, and separate columns for every person. The rows will contain the sum of products sold by each person in every product category.

There Is No Pivot Function in Athena

The first problem arises because Athena and PrestoSQL don’t have a PIVOT function. We need to use weird workaround. First, we have to group all salespeople into an array and store those arrays inside a map.

1
2
3
SELECT category, multimap_agg(name, how_many) as agg
FROM sales
GROUP BY category

This query creates an output which should look like this:

1
2
3
4
5
6
7
8
9
+---------------+-----------------------------------------------+
|       category|                                           agg |
+---------------+-----------------------------------------------+
|      insurance|                   {Janusz=[0], Grażyna=[2, 3]}|
|savings account|  {Janusz=[5], Grażyna=[1, 1], Sebastian=[0, 2]|
|    credit card|                          {Sebastian=[0, 2, 1]}|
|       mortgage|                    {Janusz=[2], Sebastian=[4]}|
|   term deposit|                                 {Janusz=[4, 9]|
+---------------+-----------------------------------------------+

Now, we have to use the reduce function to calculate the sum of every person’s sales. The previous SELECT statement becomes a subquery of the following SQL:

1
2
3
4
5
6
7
8
9
10
SELECT
  category,
  reduce(agg['Grażyna'], 0.0, (s, x) -> s + x, s -> s) AS "Grażyna",
  reduce(agg['Janusz'], 0.0, (s, x) -> s + x, s -> s) AS "Janusz",
  reduce(agg['Sebastian'], 0.0, (s, x) -> s + x, s -> s) AS "Sebastian"
FROM (
	SELECT category, multimap_agg(name, how_many) as agg
	FROM sales
	GROUP BY category
)

Finally, we get the result:

1
2
3
4
5
6
7
8
9
+---------------+-------+------+---------+
|       category|Grażyna|Janusz|Sebastian|
+---------------+-------+------+---------+
|savings account|      2|     5|        2|
|   term deposit|   null|    13|     null|
|       mortgage|   null|     2|        4|
|    credit card|   null|  null|        3|
|      insurance|      5|     0|     null|
+---------------+-------+------+---------+

Why There Is No Pivot Function

Unfortunately, it is not possible to automatically generate the column names, and we have to list all of the values we want and use the reduce function to extract them. In PrestoSQL, the schema must be available during query planning, so the analyzer can access all the information it needs.

In the case of the pivot function, the schema depends on the data in the table, so PrestoSQL (and Athena) would have to execute half of the query, look at the available values, and rewrite the execution plan of the subsequent operations. This limitation is also the reason why the pivot function does not exist.

For a detailed explanation, please take a look at the discussion on the PrestoSQL GitHub page: https://github.com/prestosql/presto/issues/1206

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.