How to select a random sample of rows using Athena
This article shows you how to use the window function and random sorting to select a random sample of rows grouped by a column.
First, we will use the window function to group the rows by a given column and order them randomly. Let’s assume that I have an Athena table called
the_table with a column called
column_A. In this case, the window function looks like this:
1 ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
I will put that window function in a subquery:
1 2 3 4 5 WITH random_order AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn FROM the_table )
If I want to get 1000 random samples in every group, I have to select the rows with the
rn parameter equal or less than 1000:
1 2 3 4 5 6 7 8 WITH random_order AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn FROM the_table ) SELECT * FROM random_order WHERE rn <= 1000
You may also like
- What do you need to know about storing passwords in AWS?
- How to temporarily disable an AWS Lambda function using AWS CLI without removing the function
- How to check when an Athena table was updated
- What is s3:TestEvent, and why does it break my event processing?
- How to use AWS Batch to run a Python script