How to write Hive queries with column position number in the GROUP BY or ORDER BY clauses

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (79/100)

By default, Hive does not support passing the columns to GROUP BY or ORDER BY using their positions instead of names. Therefore, when I try to run this query:

SELECT column_A, count(*) FROM table_name GROUP BY 1

I will see an error message telling me that the column_A is not in the GROUP BY expression.

To get the behavior I want, I have to enable the position alias usage before running my query. I can do that by setting the hive.groupby.orderby.position.alias property to true:

SET hive.groupby.orderby.position.alias=true;

SELECT column_A, count(*) FROM table_name GROUP BY 1

Now, Hive is going to correctly recoginize that I want to group by the first column, lookup its name and use that to execute the query.

Subscribe to the newsletter and join the free email course.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.