How to select a random sample of rows using Athena

This article shows you how to use the window function and random sorting to select a random sample of rows grouped by a column.

First, we will use the window function to group the rows by a given column and order them randomly. Let’s assume that I have an Athena table called the_table with a column called column_A. In this case, the window function looks like this:

ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn

I will put that window function in a subquery:

WITH random_order AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
    FROM the_table
)

If I want to get 1000 random samples in every group, I have to select the rows with the rn parameter equal or less than 1000:

WITH random_order AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
    FROM the_table
)
SELECT *
FROM random_order
WHERE rn <= 1000
Older post

Use HttpSensor to pause an Airflow DAG until a website is available

Pause an Airflow DAG until an HTTP endpoint returns 200 OK

Newer post

Copy directories in S3 using s3-dist-cp

How to copy files in S3 and preserve the directory structure