How Airflow scheduler works

Do you think that Airflow scheduler is unintuitive? Have you tried to use intervals and the only method that worked was “trial and error?” Does it look like the only way to set it up correctly is tweaking the settings until somehow you get it right? If yes, this is a blog post for you ;)

I am going to explain the Airflow scheduler and maybe, just maybe, help you understand how it works.

Example

Imagine that Airflow is a person with a timer. Usually, we use Airflow to move data around, and recently I published summaries of the Data Janitor talks, so let’s assume that Airflow is a janitor.

We set the start_date to 9 am today and the interval to “@hourly”. It means that we have told Airflow to start working at 9 am and move some boxes at the beginning of every hour.

Our janitor comes to work at 9 am as expected. Do you think that he/she starts working right away? Obviously no. First, the coffee has to be made. After that, there is a time for watching or reading the news. No work is done between 9 am and 10 am.

What happens at 10 am? The “hourly” interval has passed, and now it is time to move some data around. The same happens at 11 am and later — every hour. At this, point my example breaks down because the real janitor goes home at 5 pm, but Airflow is going to continue working at repeat the same job every hour. Continuously, even at night.

Start date

The problem is that “start_date” parameter is counterintuitive. The job does not start at this time. So what starts?

In Airflow start_date is the time when Airflow starts the timer. It just starts measuring the time.

Older post

Guidelines for data science teams — a summary of Daniel Molnar’s talks

Avoiding over-engineering in machine learning

Newer post

How to reduce memory usage in Pandas

Fit more data in the same amount of memory