Best practices about partitioning data in S3 by date
In one of my projects, we have partitioned data in S3 in a way that makes everything complicated. Seriously. Writing any query that is more complicated than retrieving everything from one partition is a nightmare.
Don’t partition by year, month, and day
It seems to be a good idea to have files partitioned by three values: year, month, and day. This allows you to easily retrieve all rows assigned to a particular year or a month in a year. The object keys look so tidy. For example, you can have data stored in
s3://some_bucket/some_key/year=2020/month=12/day=01. Which, seemingly, makes sense because that is the natural way of partitioning things by date. It is so perfect, except it isn’t. At all.
Creating a hierarchy of partitions and partitioning files separately by year, month, and the day is a terrible idea if you ever want to make date range queries. What if you don’t want all rows added in December 2020? What if I want rows between 2020.11.20 and 2020.12.20? Good luck with that!
What will you do? List every single date in a huge WHERE statement like this:
1 2 3 4 5 6 7 WHERE (year = '2020' and month = '11' and day = '20') OR (year = '2020' and month = '11' and day = '21') OR (year = '2020' and month = '11' and day = '22') OR ... (year = '2020' and month = '12' and day = '18') OR (year = '2020' and month = '12' and day = '19') OR (year = '2020' and month = '12' and day = '20')
Do you know that Athena queries have a length limit? No? You will find out that soon ;)
You may try grouping predicates by the common parts of the query:
1 WHERE (year = '2020' and month = '11' and day in ('20', '21', '22', ...) OR (year = '2020' and month = '12' and day in ('01', '02', ...)
For sure, your query will be shorter. For sure, the code that generates that SQL statement will have tons of bugs. Don’t you love leap years? ;)
The correct way to partition by date
Is there a better way? Yeah! Forget about splitting the partition into chunks. You can put the whole date (or maybe the date with time) as the partition name and have object keys that look like this:
If you write your dates in the ISO8601 format, you will end up with dates in alphabetical order, so all of your range queries will work correctly!
1 WHERE dt >= '2020-11-20' AND dt <= '2020-12-20'
Isn’t it beautiful when dates work like dates?
You may say that now it is more difficult to get every value from 2020.
I agree that
WHERE dt >= '2020-01-01' AND dt <= '2020-12-31' is a little longer than
WHERE year = '2020', but I think that shortening the notation in a few special cases is not worth the additional effort on a daily basis.
You may also like
- How to decode base64 to text in AWS Athena
- What do you need to know about storing passwords in AWS?
- How to add dependencies to AWS lambda
- How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML
- How to get an alert if an AWS lambda does not get invoked during the last 24 hours