How Airflow scheduler works
Do you think that Airflow scheduler is unintuitive? Have you tried to use intervals and the only method that worked was “trial and error?” Does it look like the only way to set it up correctly is tweaking the settings until somehow you get it right? If yes, this is a blog post for you ;)
I am going to explain the Airflow scheduler and maybe, just maybe, help you understand how it works.
Imagine that Airflow is a person with a timer. Usually, we use Airflow to move data around, and recently I published summaries of the Data Janitor talks, so let’s assume that Airflow is a janitor.
We set the start_date to 9 am today and the interval to “@hourly”. It means that we have told Airflow to start working at 9 am and move some boxes at the beginning of every hour.
Our janitor comes to work at 9 am as expected. Do you think that he/she starts working right away? Obviously no. First, the coffee has to be made. After that, there is a time for watching or reading the news. No work is done between 9 am and 10 am.
What happens at 10 am? The “hourly” interval has passed, and now it is time to move some data around. The same happens at 11 am and later — every hour. At this, point my example breaks down because the real janitor goes home at 5 pm, but Airflow is going to continue working at repeat the same job every hour. Continuously, even at night.
The problem is that “start_date” parameter is counterintuitive. The job does not start at this time. So what starts?
In Airflow start_date is the time when Airflow starts the timer. It just starts measuring the time.
You may also like
- Three biggest traps to avoid while setting Spark executor memory
- Calculating the cumulative sum of a group using Apache Spark
- Dependencies between DAGs: How to wait until another DAG finishes in Airflow?
- Testing data products: BDD for data engineers
- Making your Scrapy spider undetectable by applying basic statistics