Testing legacy data pipelines

Have you ever seen a team afraid of a data pipeline? The code constantly fails in production. They have to spend hours restarting the tasks and copying the data between locations to make it work. However, they never attempt to fix the underlying problem.

“You don’t touch that code,” they say, “every time we want to fix it, everything becomes even worse.”

Can we do anything about such data pipelines?

How to understand and fix old code

Let’s be honest here. You will need to refactor the code. You will need to change its structure to reveal the intent and makes it easier to understand and modify.

Working with messy code resembles cooking at home without having a proper kitchen. If your cooker is in one room, the only sink in the house in another one, you keep the food in the fridge standing in the basement, and your cooking utensils are all over the place, you will burn all the food (and maybe the house too) while running around to find what you need. Clean up the mess first.

Unfortunately, refactoring the code isn’t as easy as collecting all of the pieces together and putting them in one place. It wouldn’t work. We need to ensure we don’t change the code’s behavior while refactoring.

How do we do it? We need automated tests. That’s the only way.

Adding tests to the legacy code

How do we add tests to existing code? It shouldn’t be a problem because many developers do it all the time. First, they write the code. Later, they write the tests for the code they had just written.

However, this time we had one minor complication. Someone else wrote the code months or years ago. What do we do now?

Start testing in the worst part of the code

To see the benefits of your effort fast and get the largest return-on-investment possible, we must begin our efforts in the worst part of the code.

What is the worst? I assume you know which part breaks most often. That’s the place where we must start.

However, our task may differ slightly depending on the part of the code causing the problems. Is this the data loading code? In this case, we will test the data validation code (assuming it exists) because, in the long run, we need to improve the mechanism filtering out invalid data. Is the processing pipeline breaking in the middle of the transformation because of nulls? Some earlier actions caused the problem. Look for the part producing those nulls and start testing there.

Locate the troublemaking code and write tests for its current behavior. It’s crucial to test the current behavior! We don’t change it yet. It may be wrong, but we must preserve it. At least for a while.

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)

Cleaning the mess

I recommend extracting the code you want to change to a separate class or module. After extracting the code and adding the tests, you have a decently tested module you can refactor.

It’s not the time to fix the bugs yet! Our goal now is to make the code easy to understand. We can rename variables or extract functions and classes. Do whatever you have to do to make it easy to work with.

Fixing the problems

You have probably noticed a bunch of errors in the code already. Some problems became apparent when you removed the mess hiding the issue. Great! Now, we can start fixing it.

How do we do it? First, we make sure we have a test for the current behavior of the code. The test must pass.

After that, we write a new test for the expected behavior. At this point, you can mark the old test as ignored or remove it already. You may want to look at the old test code, but we won’t run it anymore. Whatever you do, remember to remove the old test when you finish.

With a new failing test, we can start the red-green-refactor cycle just like any new code. At this stage, we are in the familiar, friendly environment of TDD again.

Should you care about test coverage?

Test coverage is a useless metric at every stage of every project.

The code coverage doesn’t tell you whether the programmer has written the tests correctly or whether the author overmocked all dependencies, wrote a test for every method, and repeated the same logic in the tests and the code.

We want to test every branch of the behavior, not every code branch. Of course, testing all of the behavior will test all of the code. However, when we write those tests, we want to look at a higher level of abstraction.

Testing every function separately is a waste of time. The behavior (and all the bugs) is hidden in the interactions between the parts of the code. We should test that.

There is no metric telling us whether we tested all code behaviors. Because of that, we shouldn’t worry too much about test coverage.

Protect the data!

All your efforts are futile if you don’t ensure the correctness of the output data.

In data engineering, that’s our only goal. Without it, everything else makes no sense.

The best time investment you can make is testing the incoming data and ensuring data quality before running any data transformations. Because of that, I suggest starting your testing and refactoring efforts with the data validation code. After all, the garbage-in garbage-out rule applies to all data pipelines.

When you finish with data validation, I suggest switching to testing the output of the entire pipeline. In short, you validate whether your results are correct. We don’t want to propagate the mess downstream, do we? We don’t need untrustworthy data in the data lake.

In the end, we can deal with the data transformation logic. There is a catch; however, what if the output is wrong because of an error at an earlier stage? In such a case, we write the tests for the output first, then we test the problematic part of the code and start fixing the root cause of the issue.

We make small steps towards the root cause at every stage, and we add tests along the way. That’s the only rule for working with untested data pipelines.

What kind of testing?

When discussing software testing, we usually must clarify what kind of tests we expect. Do we write unit tests? What is a unit anyway?

We don’t need to name the tests. Programmers want to work with unified structures, and we tend to put things into distinct categories making up arbitrary rules about differences between those categories. However, data engineers should feel good working with unstructured data. We may have unstructured tests too.

The goal is to test the business logic and get the high-quality data to the data warehouse or the data lake as fast as possible. If we succeed, the kind of tests we write won’t matter.

Nevertheless, I have a suggestion about testing data pipelines. What seems to work well is a testing method based on behavior-driven development. It works because you must describe the business logic in a human-readable language, so you may spot inconsistencies or get the logic validated by a subject matter expert.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.