How to find the Hive partition closest to a given date
In Airflow, there is a built-in function, which we can use to find the Hive partition closest to the given date. However, it works only with partition identifiers in the YYYY-mm-dd format, so if you use a different partitioning method, this function will not help you.
To find the closest Hive partition, we should use the
1 2 3 4 5 6 7 8 9 from airflow.macros.hive import closest_ds_partition closest_ds_partition( hive_table_name, the_date, before=True, schema='hive_schema', metastore_conn_id='metastore_connection_id' )
Be careful with the
before parameter. It has a weird behavior. As you may expect,
True means a partition before the given date,
False returns the partition after a given date, but when the
before parameter is set to
None it returns the closest partition, and it does not matter whether it is before or after the given date.
Please don’t follow this coding practice. Three value “boolean” logic is a terrible, terrible idea. It is way better to use an enum with descriptive names.
You may also like
- How to run Airflow in Docker (with a persistent database)
- How to prevent Airflow from backfilling old DAG runs
- Use LatestOnlyOperator to skip some tasks while running a backfill in Airflow
- Use HttpSensor to pause an Airflow DAG until a website is available
- How to get an array/bag of elements from the Hive group by operator?