How to get an array/bag of elements from the Hive group by operator?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (14/100)

It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.

Imagine that I have the following table with two columns:

1
2
3
4
5
6
7
C1  C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1

and I would like to get the following result:

1
2
3
AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]

In Hive, we can achieve that result when we group by the C1 column and use the the collect_list function to combine the data from the C2 column:

1
2
3
SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1


Removing duplicates

You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list function with the collect_set function:

1
2
3
SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1

In this case, the output will look like this:

1
2
3
AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group