How to Speed Up AWS Athena Queries Using Partition Projection

Partition Projection in AWS Athena is a recently added feature that speeds up queries by defining the available partitions as a part of table configuration instead of retrieving the metadata from the Glue Data Catalog.

It makes Athena queries faster because there is no need to query the metadata catalog. Also, when partition projection is enabled, it is no longer necessary to reload partitions to rebuild the metadata catalog after storing new files in S3.

How to Create a Table with Partition Projection

The AWS documentation shows how to add Partition Projection to an existing table. In this article, I will define a new table with partition projection using the CREATE TABLE statement.

Let’s assume that I have an S3 bucket full of Parquet files stored in partitions that denote the date when the file was stored. For example, I have an S3 key which looks like this: s3://my_bucket_name/files/year=2020/month=08/day=29/f_001

We see that my files are partitioned by year, month, and day. In this case, the standard CREATE TABLE statement that uses the Glue Data Catalog to store the partition metadata looks like this:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table_name
(
  -- some columns here
)
PARTITIONED BY (
  year STRING,
  month STRING,
  day STRING
)
STORED AS PARQUET
LOCATION 's3://my_bucket_name/files/';

Enabling Partition Projection

When I want to enable the Partition Projection feature for my_table_name table, I must TBLPROPERTIES that describes my partitioning method. I have three text columns that denote parts of a date.

In my case, the year may have values between 2015 and 2020. Therefore, its properties look like this:

projection.year.type = 'integer',
projection.year.range= '2015,2020'

The month column is a little bit more tricky because I use the leading zeros. The numeric values of months between January and September are preceded by zero. Because of that, in addition to specifying the value range, I have to configure the number of digits:

projection.month.type = 'integer',
projection.month.range= '1,12'
projection.month.digits = 2

Finally, I have to do the same with the day column. There are no tricks that allow me to have fewer day partitions in months shorter than 31 days. I must use the range of 1-31. Athena will ignore empty partitions while retrieving the data.

projection.day.type = 'integer',
projection.month.range= '1,31'
projection.month.digits = 2

The new CREATE TABLE statement that defines projected partitions looks like this:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table_name
(
  -- some columns here
)
PARTITIONED BY (
  year STRING,
  month STRING,
  day STRING
)
STORED AS PARQUET
LOCATION 's3://my_bucket_name/files/'
TBLPROPERTIES (
    projection.enabled = true,
	projection.year.type = integer,
    projection.year.range = 2015,2020,
    projection.month.type = integer,
    projection.month.range = 1,12
    projection.month.digits = 2,
    projection.day.type = integer,
    projection.month.range = 1,31
    projection.month.digits = 2
)

Now, when I query the table in Athena, it will generate the S3 key prefix that it needs to scan without using the Glue Data Catalog.

When to Avoid Using Partition Projection

The Partition Projection feature is available only in AWS Athena. Even if a table definition contains the partition projection configuration, other tools will not use those values. For example, Apache Spark, Hive, Presto read partition metadata directly from Glue Data Catalog and do not support partition projection. When we have a table used by a service running on EMR, we MUST continue using Glue Data Catalog.

EMR and Redshift Spectrum look for the partitions in the metadata. If we add new files to the S3 location and a new partition should be created, we MUST reload the partitions. Otherwise, only AWS Athena will find the data (because of partition projection).

Adding new files and creating new partitions causes another issue. The partition projection configuration is static. If we want to change it, we must recreate the table.

If we add a new partition value outside of the range defined as a partition projection, Athena will not find those files. When we have partition projection enabled, Athena does not retrieve the metadata from Glue. Therefore, reloading the partition metadata in Glue Data Catalog will not help solve that issue. When such a problem occurs, we can use the Dynamic ID Partitioning to avoid recreating the table every time we want to add a new partition value.

Older post

How to send a customized Slack notification when an Airflow task fails

How to customize a Slack notification before sending it to the Slack incoming webhook.

Newer post

Is counting rows all we can do?

How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.