How to get an array/bag of elements from the Hive group by operator?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (14/100)

It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.

Imagine that I have the following table with two columns:

1
2
3
4
5
6
7
C1  C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1

and I would like to get the following result:

1
2
3
AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]

In Hive, we can achieve that result when we group by the C1 column and use the the collect_list function to combine the data from the C2 column:

1
2
3
SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1

Subscribe to the newsletter and join the free email course.

Removing duplicates

You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list function with the collect_set function:

1
2
3
SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1

In this case, the output will look like this:

1
2
3
AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.