Use the ROW_NUMBER() function to get top rows by partition in Hive

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (53/100)

Imagine that I have a SQL query that calculates the number of rows grouped by some column:

1
2
3
SELECT column_A, count(*) as cnt
FROM table_name
GROUP BY column_A

I want to get the top 10 rows in every group. To do this, I have to use the query above as a subquery and use the ROW_NUMBER() function and a window function. The window function will group the rows by the column_A and order them by the number of rows:

1
2
3
4
5
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY cnt DESC) as rank FROM (
    SELECT column_A, count(*) as cnt
    FROM table_name
    GROUP BY column_A
) t

To get only the top 10 rows, I am going to put all of that in another subquery and use WHERE to select only top values:

1
2
3
4
5
6
7
8
SELECT t2.column_A as column_A, t2.cnt as cnt FROM (
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY cnt DESC) as rank FROM (
        SELECT column_A, count(*) as cnt
        FROM table_name
        GROUP BY column_A
        ) t
    ) t2
WHERE t2.rank <= 10;

Subscribe to the newsletter and join the free email course.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.

Do you want to work with me at riskmethods?

REMOTE position (available in Poland or Germany)