Best practices about partitioning data in S3 by date
In one of my projects, we have partitioned data in S3 in a way that makes everything complicated. Seriously. Writing any query that is more complicated than retrieving everything from one partition is a nightmare.
Don’t partition by year, month, and day
It seems to be a good idea to have files partitioned by three values: year, month, and day. This allows you to easily retrieve all rows assigned to a particular year or a month in a year. The object keys look so tidy. For example, you can have data stored in
s3://some_bucket/some_key/year=2020/month=12/day=01. Which, seemingly, makes sense because that is the natural way of partitioning things by date. It is so perfect, except it isn’t. At all.
Creating a hierarchy of partitions and partitioning files separately by year, month, and the day is a terrible idea if you ever want to make date range queries. What if you don’t want all rows added in December 2020? What if I want rows between 2020.11.20 and 2020.12.20? Good luck with that!
What will you do? List every single date in a huge WHERE statement like this:
1 2 3 4 5 6 7 WHERE (year = '2020' and month = '11' and day = '20') OR (year = '2020' and month = '11' and day = '21') OR (year = '2020' and month = '11' and day = '22') OR ... (year = '2020' and month = '12' and day = '18') OR (year = '2020' and month = '12' and day = '19') OR (year = '2020' and month = '12' and day = '20')
Do you know that Athena queries have a length limit? No? You will find out that soon ;)
You may try grouping predicates by the common parts of the query:
1 WHERE (year = '2020' and month = '11' and day in ('20', '21', '22', ...) OR (year = '2020' and month = '12' and day in ('01', '02', ...)
For sure, your query will be shorter. For sure, the code that generates that SQL statement will have tons of bugs. Don’t you love leap years? ;)
The correct way to partition by date
Is there a better way? Yeah! Forget about splitting the partition into chunks. You can put the whole date (or maybe the date with time) as the partition name and have object keys that look like this:
If you write your dates in the ISO8601 format, you will end up with dates in alphabetical order, so all of your range queries will work correctly!
1 WHERE dt >= '2020-11-20' AND dt <= '2020-12-20'
Isn’t it beautiful when dates work like dates?
You may say that now it is more difficult to get every value from 2020.
I agree that
WHERE dt >= '2020-01-01' AND dt <= '2020-12-31' is a little longer than
WHERE year = '2020', but I think that shortening the notation in a few special cases is not worth the additional effort on a daily basis.
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz