How to run Airflow DAGs for a specified date in the past?

Have you created a new Airflow DAG, but now you have to run it using every data snapshot created during the last six months? Don’t worry. You don’t need to experiment with the start_date parameter.

Backfill

The operation of running a DAG for a specified date in the past is called “backfilling.” The Airflow command-line interface provides a convenient command to run such backfills.

First, I have to log-in to the server that is running the Airflow scheduler. If Airflow is running inside a Docker container, I have to access the command-line of the container, for example like this:

1
docker exec -it container_id /bin/sh

To run the backfill command, I need three things: the identifier of the DAG, the start date, and the end date (note that Airflow stops one day before the end date, so the end date is not inclusive).

When I have the required information, I can run the command to start backfill. In this case, I am running the test_dag DAG with the execution date set to 2019-01-01, 2019-01-02, and 2019-01-03.

1
airflow backfill -s 2019-01-01 -e 2019-01-04 test_dag


How about re-running completed DAGs?

The backfill command does not re-run completed DAGs within the given period unless we explicitly instruct it to do so. Therefore, if there was already a DAG run on 2019-01-02 and I would like to repeat it, I have to add –reset_dagruns to the airflow backfill command.

1
airflow backfill -s 2019-01-01 -e 2019-01-04 --reset_dagruns test_dag

Important note about SSH

Airflow backfill does not schedule all DAGs at once! It starts the first one, waits until it finishes, and then schedules the next one.

Because of that, I need to keep an active SSH connection to the Airflow server until the backfill schedules the last DAG run. If I got disconnected before Airflow had scheduled the last DAG, it would finish the currently running DAG and never schedule the remaining ones.

To avoid problems in case of a lost internet connection, I suggest using the screen application to start a durable terminal session on the Airflow server and run the backfill command inside the screen session.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group