How to select a random sample of rows using Athena
This article shows you how to use the window function and random sorting to select a random sample of rows grouped by a column.
First, we will use the window function to group the rows by a given column and order them randomly. Let’s assume that I have an Athena table called the_table
with a column called column_A
. In this case, the window function looks like this:
1
ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
I will put that window function in a subquery:
1
2
3
4
5
WITH random_order AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
FROM the_table
)
If I want to get 1000 random samples in every group, I have to select the rows with the rn
parameter equal or less than 1000:
1
2
3
4
5
6
7
8
WITH random_order AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
FROM the_table
)
SELECT *
FROM random_order
WHERE rn <= 1000
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How to Speed Up AWS Athena Queries Using Partition Projection
- How to get a notification when a new file is uploaded to an S3 bucket
- How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML
- How to add an EMR step from AWS Lambda
- How to use AWS Batch to run a Python script

Bartosz Mikulski
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz