Best practices about partitioning data in S3 by date

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (91/100)

In one of my projects, we have partitioned data in S3 in a way that makes everything complicated. Seriously. Writing any query that is more complicated than retrieving everything from one partition is a nightmare.

Don’t partition by year, month, and day

It seems to be a good idea to have files partitioned by three values: year, month, and day. This allows you to easily retrieve all rows assigned to a particular year or a month in a year. The object keys look so tidy. For example, you can have data stored in s3://some_bucket/some_key/year=2020/month=12/day=01. Which, seemingly, makes sense because that is the natural way of partitioning things by date. It is so perfect, except it isn’t. At all.

Creating a hierarchy of partitions and partitioning files separately by year, month, and the day is a terrible idea if you ever want to make date range queries. What if you don’t want all rows added in December 2020? What if I want rows between 2020.11.20 and 2020.12.20? Good luck with that!

What will you do? List every single date in a huge WHERE statement like this:

WHERE (year = '2020' and month = '11' and day = '20') OR 
(year = '2020' and month = '11' and day = '21') OR
(year = '2020' and month = '11' and day = '22') OR 
(year = '2020' and month = '12' and day = '18') OR 
(year = '2020' and month = '12' and day = '19') OR 
(year = '2020' and month = '12' and day = '20')

Do you know that Athena queries have a length limit? No? You will find out that soon ;)

You may try grouping predicates by the common parts of the query:

WHERE (year = '2020' and month = '11' and day in ('20', '21', '22', ...) OR (year = '2020' and month = '12' and day in ('01', '02', ...)

For sure, your query will be shorter. For sure, the code that generates that SQL statement will have tons of bugs. Don’t you love leap years? ;)

Subscribe to the newsletter and join the free email course.

The correct way to partition by date

Is there a better way? Yeah! Forget about splitting the partition into chunks. You can put the whole date (or maybe the date with time) as the partition name and have object keys that look like this: s3://some_bucket/some_key/dt=2020-12-01

If you write your dates in the ISO8601 format, you will end up with dates in alphabetical order, so all of your range queries will work correctly!

WHERE dt >= '2020-11-20' AND dt <= '2020-12-20'

Isn’t it beautiful when dates work like dates?

You may say that now it is more difficult to get every value from 2020. I agree that WHERE dt >= '2020-01-01' AND dt <= '2020-12-31' is a little longer than WHERE year = '2020', but I think that shortening the notation in a few special cases is not worth the additional effort on a daily basis.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.