How to select a random sample of rows using Athena

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (59/100)

This article shows you how to use the window function and random sorting to select a random sample of rows grouped by a column.

First, we will use the window function to group the rows by a given column and order them randomly. Let’s assume that I have an Athena table called the_table with a column called column_A. In this case, the window function looks like this:

1
ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn

I will put that window function in a subquery:

1
2
3
4
5
WITH random_order AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
    FROM the_table
)

If I want to get 1000 random samples in every group, I have to select the rows with the rn parameter equal or less than 1000:

1
2
3
4
5
6
7
8
WITH random_order AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
    FROM the_table
)
SELECT *
FROM random_order
WHERE rn <= 1000

Subscribe to the newsletter and join the free email course.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.

Do you want to work with me at riskmethods?

REMOTE position (available in Poland or Germany)