How to get an array/bag of elements from the Hive group by operator?
It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.
Imagine that I have the following table with two columns:
1 2 3 4 5 6 7 C1 C2 AAA AA1 AAA AA2 AAA AA2 BBB BB1 BBB BB2 CCC CC1
and I would like to get the following result:
1 2 3 AAA [AA1, AA2, AA2] BBB [BB1, BB2] CCC [CC1]
In Hive, we can achieve that result when we group by the C1 column and use the the
collect_list function to combine the data from the C2 column:
1 2 3 SELECT C1, collect_list(C2) FROM some_table GROUP BY C1
You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the
collect_list function with the
1 2 3 SELECT C1, collect_set(C2) FROM some_table GROUP BY C1
In this case, the output will look like this:
1 2 3 AAA [AA1, AA2] BBB [BB1, BB2] CCC [CC1]