Best practices about partitioning data in S3 by date

In one of my projects, we have partitioned data in S3 in a way that makes everything complicated. Seriously. Writing any query that is more complicated than retrieving everything from one partition is a nightmare.

Don’t partition by year, month, and day

It seems to be a good idea to have files partitioned by three values: year, month, and day. This allows you to easily retrieve all rows assigned to a particular year or a month in a year. The object keys look so tidy. For example, you can have data stored in s3://some_bucket/some_key/year=2020/month=12/day=01. Which, seemingly, makes sense because that is the natural way of partitioning things by date. It is so perfect, except it isn’t. At all.

Creating a hierarchy of partitions and partitioning files separately by year, month, and the day is a terrible idea if you ever want to make date range queries. What if you don’t want all rows added in December 2020? What if I want rows between 2020.11.20 and 2020.12.20? Good luck with that!

What will you do? List every single date in a huge WHERE statement like this:

WHERE (year = '2020' and month = '11' and day = '20') OR
(year = '2020' and month = '11' and day = '21') OR
(year = '2020' and month = '11' and day = '22') OR
...
(year = '2020' and month = '12' and day = '18') OR
(year = '2020' and month = '12' and day = '19') OR
(year = '2020' and month = '12' and day = '20')

Do you know that Athena queries have a length limit? No? You will find out that soon ;)

You may try grouping predicates by the common parts of the query:

WHERE (year = '2020' and month = '11' and day in ('20', '21', '22', ...) OR (year = '2020' and month = '12' and day in ('01', '02', ...)

For sure, your query will be shorter. For sure, the code that generates that SQL statement will have tons of bugs. Don’t you love leap years? ;)

The correct way to partition by date

Is there a better way? Yeah! Forget about splitting the partition into chunks. You can put the whole date (or maybe the date with time) as the partition name and have object keys that look like this: s3://some_bucket/some_key/dt=2020-12-01

If you write your dates in the ISO8601 format, you will end up with dates in alphabetical order, so all of your range queries will work correctly!

WHERE dt >= '2020-11-20' AND dt <= '2020-12-20'

Isn’t it beautiful when dates work like dates?

You may say that now it is more difficult to get every value from 2020. I agree that WHERE dt >= '2020-01-01' AND dt <= '2020-12-31' is a little longer than WHERE year = '2020', but I think that shortening the notation in a few special cases is not worth the additional effort on a daily basis.

Older post

How to write to a SQL database using JDBC in PySpark

How to use JDBC driver in PySpark to write a DataFrame to a SQL database

Newer post

Add the row insertion time to a MySQL table

Automatically add the insertion and update time in MySQL