How to retrieve the statuses of the recent DAG executions from Airflow database

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (23/100)

If you want to create a nice dashboard that displays the statuses of the most recent Airflow DAG runs, you will need to retrieve them from the Airflow database:

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
    dag_run.dag_id,
    dag_run.state
FROM dag_run
INNER JOIN (SELECT
                dag_id,
                MAX(execution_date) AS date
            FROM dag_run
            GROUP BY dag_id) mx
    ON
    dag_run.dag_id = mx.dag_id
    AND dag_run.execution_date = mx.date
JOIN dag ON dag.dag_id = dag_run.dag_id AND is_active = 1 AND is_paused = 0

This query retrieves the most recent DAG run and returns its id and state. Additionally, it filters out the paused DAGs, so we don’t pollute the results with things that don’t run in production right now.

Of course, creating a dashboard of Airflow DAG runs takes some time, so you may prefer to use the project we have open-sourced: https://github.com/Wikia/discreETLy. In addition to DAG execution times and statuses, it also displays Athena tables’ last update time and their descriptions.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group