How to make a pivot table in AWS Athena or PrestoSQL

In the previous article, I showed how to make a pivot DataFrame using Apache Spark. This blog post shows the same operation in AWS Athena. Because Athena is built on top of PrestoDB, this method will also work in PrestoDB and PrestoSQL.

Let’s use the same data as in the previous text. We have the following table that describes the sales of financial products. We have the category, the name of the salesperson, and the number of products sold.

+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+

We want to create a pivot table that contains the category, and separate columns for every person. The rows will contain the sum of products sold by each person in every product category.

There Is No Pivot Function in Athena

The first problem arises because Athena and PrestoSQL don’t have a PIVOT function. We need to use weird workaround. First, we have to group all salespeople into an array and store those arrays inside a map.

SELECT category, multimap_agg(name, how_many) as agg
FROM sales
GROUP BY category

This query creates an output which should look like this:

+---------------+-----------------------------------------------+
|       category|                                           agg |
+---------------+-----------------------------------------------+
|      insurance|                   {Janusz=[0], Grażyna=[2, 3]}|
|savings account|  {Janusz=[5], Grażyna=[1, 1], Sebastian=[0, 2]|
|    credit card|                          {Sebastian=[0, 2, 1]}|
|       mortgage|                    {Janusz=[2], Sebastian=[4]}|
|   term deposit|                                 {Janusz=[4, 9]|
+---------------+-----------------------------------------------+

Now, we have to use the reduce function to calculate the sum of every person’s sales. The previous SELECT statement becomes a subquery of the following SQL:

SELECT
  category,
  reduce(agg['Grażyna'], 0.0, (s, x) -> s + x, s -> s) AS "Grażyna",
  reduce(agg['Janusz'], 0.0, (s, x) -> s + x, s -> s) AS "Janusz",
  reduce(agg['Sebastian'], 0.0, (s, x) -> s + x, s -> s) AS "Sebastian"
FROM (
	SELECT category, multimap_agg(name, how_many) as agg
	FROM sales
	GROUP BY category
)

Finally, we get the result:

+---------------+-------+------+---------+
|       category|Grażyna|Janusz|Sebastian|
+---------------+-------+------+---------+
|savings account|      2|     5|        2|
|   term deposit|   null|    13|     null|
|       mortgage|   null|     2|        4|
|    credit card|   null|  null|        3|
|      insurance|      5|     0|     null|
+---------------+-------+------+---------+

Why There Is No Pivot Function

Unfortunately, it is not possible to automatically generate the column names, and we have to list all of the values we want and use the reduce function to extract them. In PrestoSQL, the schema must be available during query planning, so the analyzer can access all the information it needs.

In the case of the pivot function, the schema depends on the data in the table, so PrestoSQL (and Athena) would have to execute half of the query, look at the available values, and rewrite the execution plan of the subsequent operations. This limitation is also the reason why the pivot function does not exist.

For a detailed explanation, please take a look at the discussion on the PrestoSQL GitHub page: https://github.com/prestosql/presto/issues/1206

Older post

What is the difference between repartition and coalesce in Apache Spark?

When should you use coalesce instead of repartition in Apache Spark

Newer post

How to use the window function to get a single row from each group in Apache Spark

How to group values by a key and extract a single row from each group in Apache Spark