Use the ROW_NUMBER() function to get top rows by partition in Hive

Imagine that I have a SQL query that calculates the number of rows grouped by some column:

SELECT column_A, count(*) as cnt
FROM table_name
GROUP BY column_A

I want to get the top 10 rows in every group. To do this, I have to use the query above as a subquery and use the ROW_NUMBER() function and a window function. The window function will group the rows by the column_A and order them by the number of rows:

SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY cnt DESC) as rank FROM (
    SELECT column_A, count(*) as cnt
    FROM table_name
    GROUP BY column_A
) t

To get only the top 10 rows, I am going to put all of that in another subquery and use WHERE to select only top values:

SELECT t2.column_A as column_A, t2.cnt as cnt FROM (
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY cnt DESC) as rank FROM (
        SELECT column_A, count(*) as cnt
        FROM table_name
        GROUP BY column_A
        ) t
    ) t2
WHERE t2.rank <= 10;
Older post

How to configure both core and spot instances in EMR using Terraform

Use EMR instance group to add spot instances to an EMR cluster

Newer post

Run a command on a remote server using SSH in Airflow

How to use the SSHHook in a PythonOperator to connect to a remote server from Airflow using SSH and execute a command.