In the past, we suddenly realized that we have a massive problem with their data pipeline.

It was not throwing any exceptions, but the ETL was not processing any data. Moreover, people were not using that data at that time, so the team learned about the problem almost two weeks too late.

It is kind of stressful time at work when you have to fix the pipeline, do backfill of some data, and reprocess reports generated using the invalid values.

Observability with AWS CloudWatch

I would love to avoid repeating that. That’s why I am glad that, in my current team, we invest some time in making our code easy to monitor. Because of that, we add tracking to every ETL we write and configure more and more alerts.

Doing so is easy when you are using only AWS services, but we would love to have the same level of observability also when we are running our custom code.

Such a need arises, not only in Airflow DAGs but also when running maintenance scripts on our laptops. After all, if it somehow impacts the production data, it should be monitored.

Writing to CloudWatch

To write metrics to CloudWatch from Python code, first, we have to create an instance of CloudWatch client. For that, we must import the boto library and write the following code. Note that the first example is for boto 2.49.0, and the second example runs with boto3.

# boto2

import boto

client = boto.connect_cloudwatch()
# boto3

import boto3

client = boto3.client('cloudwatch')

Now, it is time to write the metrics. Fortunately, it is quite straightforward. All we need to do is using the put_metric_data function.

# boto2

    namespace='Name of the Airflow DAG',
    dimensions={'source_table': mysql_table_name, 'output_bucket': output_s3_bucket}

In the case of boto3 the code looks like this:

# boto3

    Namespace='Name of the Airflow DAG',
            'MetricName': 'number_of_downloaded_rows',
            'Dimensions': [
                    'source_table': mysql_table_name,
                    'output_bucket': output_s3_bucket
            'Value': number_of_rows,
            'Unit': 'Count'

How to name the metrics

As you have probably noticed, my metrics have quite a detailed description. I specify not only the namespace and the metric name but also the dimensions.

I am doing it because I want to build a naming hierarchy, which starts with the name of the business process as the namespace. After that, I use the metric name to record the event, which produced the value.

There are no real rules regarding the metric name. For example, I use the name “invalid_not_a_number” to record the number of records that failed the validation because they are not numbers. “number_of_input_items” to store the number of items available at the beginning of the pipeline. “number_of_output_items” to save the result count, etc.

When that is not enough, additional labels may be added using dimensions. This gives me the ability to store the output S3 bucket name, which is quite useful when there are multiple outputs.

What to do with the metrics?

There is no benefit of collecting a vast number of metrics if nobody is looking at those metrics. The only worse situation is to collect many metrics when you care only about one of them.

In general, we should store only actionable metrics and metrics that help with debugging.

A metric is actionable when we can configure an alert that sends a Slack notification when the parameter is out of the expected range (it does not matter if we use a constant as a threshold or build a machine learning model to detect anomalies).

Of course, this works only when there are no false positives, and we don’t receive the alerts too often. “When you are on red alert every day, the red alert means nothing.”

In addition to actionable metrics, I also have metrics that help with debugging. When I get an alert telling me that there were no output results of some ETL, I am going to look at the metrics which track the input data and the state of intermediate processing steps.

Such debugging metrics are a tremendous help when we are looking for the part of the ETL that caused the problem.

Older post

How to unit test PySpark

How to speed up development by unit testing PySpark DAGs

Newer post

How to be happy at work - lessons learned from "Career superpowers" book

What can you learn from the book "Career superpowers" by James Whittaker