The problem with software testing in data engineering

What if we found a bug in our data pipelines? What if that bug were easy to fix, but it would require a lot of time spent backfilling the data?

What if we could detect the problem using a simple test that compares the actual output with expected values? It would be a shame that we had never written even such a trivial test.

I am annoyed and disappointed with myself every time we find such a bug. I am frustrated because, as a proponent of automation and TDD, I shouldn’t have accepted the situation in which we depend on code reviews and manual QA to prevent bugs. Yet, in data engineering, manual QA seems to be the most popular testing method.

In this article, I am going to describe why it is so difficult to test a data pipeline and what we can do about it. After all, data engineering is software engineering, so we should follow all of the software development practices.

Why data engineering seems to be different?

Seemingly, we are doing backend engineering. We get some input from other applications, process it, and write it somewhere else. There is nothing special about it. The problem starts when we try to write a test case for our code.

Can we test the code that transforms the data from one representation into another? That is easy to test. We can do that. We can prepare dozens of test cases, push them through our transformations, and compare the results with the expected values.

The size of the data causes the first problem. We have a few MBs of test cases, but in production, the same code is going to transform a few TBs. The same software that passes the test may not handle the massive production dataset.

Should we use TBs of production data for testing? It seems impractical. First, we will never know whether the expected values are correct. Nobody is going to manually produce a list of exected results for TBs of input data. Also, would you trust that the person did not make a mistake while preparing the data?

What if the test fails? Will the fact that the few TBs of results are different than the expected data help me in fixing the bug?

Maybe we need two tests:

  • the first one, which uses a small dataset to check whether the transformation works correctly;

  • the second one to verify whether the code can handle the production load.

Now, we must forget about the fast feedback loop. I can’t run a test case with a few TBs of data in a couple of seconds, and that code will never run on my laptop. I need a surrogate of the production Spark/Hive cluster to run the tests.

Such clusters are not cheap, so I can’t have a long-running test cluster for every developer. To save money, we have to start them before the test, run the code, and tear down the cluster every time. That is going to make the tests even slower!

It frustrates me that we must do so much additional work, and the outcome is not even close to being as effective as test-driven development in backend engineering. But things get even worse than that!

The data always changes

What if we get input that we did not expect? Can we run validation and reject it? Not really. As a data engineering team, we are often eavesdropping on the communication between the software written by other teams. We listen to messages on their message queues. We copy data from their databases.

We can’t expect that every time they want to make a change, they will come to us and ask whether our data pipeline is ready for it. Such an approach would turn the data team into an enormous bottleneck preventing everyone else from making any progress.

Because of that, we are always chasing a moving target. The input data constantly changes. Other developers make mistakes, and their software stops sending events. Somebody wants to run a quick A/B test, and we get a new property that we ignore because we don’t expect the message to contain an additional field.

The only thing we can do is to measure the quality of input data and the results we produce to check if we see any problems.

There are two challenges to overcome. First, such property-based testing or anomaly detection is far from being perfect. We may get many false positives or never notice some issues. Second, we must know what problems we are trying to find before we even see them. Otherwise, we will not create a test to detect them.

Anyway, this does not prevent any errors. The problems we detect have already happened. All we can do is fixing the code, running a backfill, or dealing with the fact that we have lost some data.

Is there a solution?

I don’t know. On the one hand, we need more automated testing in data engineering. On the other hand, every testing method seems inadequate.

One way of dealing with the issue is to switch from “pull” to “push” data ingestion. In this approach, we can assume that pushing updates to a shared data warehouse is just another feature of every service.

It may solve one problem, but it also creates tons of trouble for the data analysts. If we disperse the responsibility, they will have to request the data they need separately from multiple teams.

What can we do?

This is why data engineering teams exist. We are here to deal with data problems, so other people don’t need to think about it.

We have to accept the fact that we will have missing data, incorrect values, or garbage pushed by bots. We must have a method of detecting the issue even if we can’t prevent it, and we must know how to fix it.

Older post

How does Kafka Connect work?

How does a Connector work? What is a Worker in Kafka Connect? How does the data get processed inside Kafka Connect, and why does it need internal Kafka topics?

Newer post

How to conditionally skip tasks in an Airflow DAG

How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.