Is counting rows all we can do?

During the weekend, I was reading the “Building LinkedIn’s Real-Time Activity Data Pipeline” paper. In this paper, they describe how they switched from batch-oriented processing to streams. Disappointed with existing message queues, they started building the first version of Kafka. The article provides an overview of Kafka components and explains how LinkedIn built a data engineering infrastructure feeding data from Kafka into Hadoop.

For me, the most interesting part of the paper was the “Operational Considerations” part, which starts with the sentence: “One of the most important considerations in bringing a system to production is the need to fully monitor its correctness.” Yeah! Exactly!

Now, imagine my disappointment when I read that the only thing their monitoring system is doing is… counting messages. The team at LinkedIn built an audit trail implemented as a separate Kafka topic. Each application that is a part of their pipeline sends a report to the audit trail every ten minutes (the intervals use message timestamps, not the system clock). The report contains the number of processed messages for each topic. There is also a consumer of the audit trail that is looking for missing (output smaller than input) and duplicate messages (output larger than input). The authors claim that they achieved 99.9999% correctness because crashing producers and dropping unsent messages caused most of their problems.

I’m not so lucky. Yesterday, I was fixing a bug in the code that was supposed to cast data between formats. We ended up with one value repeated in every row. A simple check that compares the histogram of input value with the histogram of output values could detect the bug. Unfortunately, such a check was never written, and we will probably never write it because there are more important things to do.

Today, I spotted that something sends browser user agents instead of user’s location. The only thing I can do now is to drop that value and pretend that it never existed. This time we lacked input validation at the ingestion stage of the pipeline.

The worst part is that those small, insignificant problems cause huge wastes of time. Fixing them is always trivial, but debugging and finding the cause takes hours. We could spend those hours doing something else. Is it the time to look at contract testing and apply that technique to data engineering? The problem is that from the perspective of data producers, we are the consumers just because we copy data from their databases or eavesdrop their message queues. Will we write consumer-driven contracts for every service that generates data we may want to copy?

In the past, I have already written a text explaining why excessive data validation and standardization would turn data engineers into a bottleneck that prevents the whole organization from making any progress. I explained why the reactive approach is the only acceptable one. However, all of that does not mean that we cannot separate useful data from noise. We should do that! We should also create dashboards that show how much data was rejected by our check. We may even propagate that information to the teams responsible for producing the data.

It is a considerable time investment. It is something so expansive that I know it is never going to happen. Ensuring data quality is never going to be a quarterly team goal. No matter how many bugs we find or how many times someone asks why two dashboards show different numbers. That’s why it should be an on-going effort, just like code quality. We should do little fixes and improvements every day. Occasionally, we may be able to justify starting a larger project. It should be easy if we don’t try it too often. After all, data quality is a business concern.

Older post

How to Speed Up AWS Athena Queries Using Partition Projection

How to define partition projection while creating an Athena table

Newer post

Why your company should use PrestoSQL

Should your team use PrestoSQL?