How to find the Hive partition closest to a given date
In Airflow, there is a built-in function, which we can use to find the Hive partition closest to the given date. However, it works only with partition identifiers in the YYYY-mm-dd format, so if you use a different partitioning method, this function will not help you.
To find the closest Hive partition, we should use the closest_ds_partition
function:
1
2
3
4
5
6
7
8
9
from airflow.macros.hive import closest_ds_partition
closest_ds_partition(
hive_table_name,
the_date,
before=True,
schema='hive_schema',
metastore_conn_id='metastore_connection_id'
)
Be careful with the before
parameter. It has a weird behavior. As you may expect, True
means a partition before the given date, False
returns the partition after a given date, but when the before
parameter is set to None
it returns the closest partition, and it does not matter whether it is before or after the given date.
Please don’t follow this coding practice. Three value “boolean” logic is a terrible, terrible idea. It is way better to use an enum with descriptive names.
You may also like
- What to do when Airflow BashOperator fails with TemplateNotFound error
- How to prevent Airflow from backfilling old DAG runs
- How to delay an Airflow DAG until a given hour using the DateTimeSensor
- How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully
- How to get an array/bag of elements from the Hive group by operator?
Remember to share on social media! If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.
If you want to contact me, send me a message on LinkedIn or Twitter.
Would you like to have a call and talk? Please schedule a meeting using this link.