How to add tests to existing code in data transformation pipelines

It’s easier to write tests before writing the implementation code. What should you do to add tests to existing code? Is it already too late to write automated tests? Do you need to rewrite everything from scratch?

Legacy code! Every code without tests is legacy code. Whether you wrote it two years ago or thirty minutes ago doesn’t matter. Let me show you how to write tests of existing legacy code.

What kind of tests can you add to existing code?

Don’t set your expectations too high.

Integration tests and smoke tests are the best we can do. However, even high-level tests are better than no tests at all. We need to refactor and restructure the code if we want more detailed tests.

When high-level tests fail, they don’t help us figure out what doesn’t work. But, at least, we will know something fails. When we have those tests, we can start refactoring the code to make individual parts testable too. Eventually, the tests will get detailed enough to tell us why something fails.

How to write tests for legacy data transformations

I assume you work on a data transformation script. You have several directories with input data, and the script produces a single output file. How do you know the code works?

What have you done to test the program manually?

I bet you ran the program and observed the behavior. You opened the output file and looked at the values.

Perhaps you started with counting the results. It’s the most straightforward test we can write.

I will tell you a secret you won’t like: Tests counting the number of rows are the lamest test possible. But, at least, we ensure we get the same number of results in every run of the script. You would be surprised if I told you how often data engineers write code producing non-reproducible results. Counting rows is a little bit better than doing nothing.

Let’s automate the manual verification.

If the data transformation needs lots of time, we should run the data processing only once in our test code. We will produce the output dataset. Then, the individual tests will verify the pre-calculated output.

In pytest, we can start by defining a fixture with the module scope and autouse set to True. The fixture will run before everything else in the test file:

import pytest

@pytest.fixture(scope="module", autouse=True)
def run_the_data_transformation():
    ...

Now, the test cases must read the output of the data transformation. We can use yet another fixture to simplify the code. We should take every opportunity to streamline the test code. It’s crucial because when we write tests after the implementation, we won’t get nice, clean test methods unless we put in the extra effort.

In the data reading fixture, we read and return the output dataset produced in the previous fixture:

@pytest.fixture(scope="module")
def output_data():
    ... # read the dataset
    return dataset

We can inject the dataframe into the test cases by passing the output_data parameter to our tests.

For example, we can write a test counting the number of results like this:

def test_consistent_number_of_results(output_data):
    expected_number_of_results = 12345
    length = len(output_data)
    assert length == expected_number_of_results, \
        f'Expected number of rows: {expected_number_of_results}. Returned number of rows: {length}.'

We have the simplest test. It’s a good start.

I don’t know what’s in your dataset, but now you have to pick one of the business cases and find an example of its output data. You are doing this test manually while looking at the results of the script run.

What are you looking for, and what do you expect to see?

If we were calculating employee salaries in our script, we could look for a full-time employee and verify whether we calculated everything correctly.

We get the row describing a single employee and check if we get the expected results:

def test_fulltime_salary_calculation(output_data):
    employee_id = 'abc'
    expected_salary = 5678
    employee_data = output_data[output_data['id'] == employee_id]

    salary = employee_data['salary'].values[0]
    assert salary == expected_salary, f'Expected salary: {expected_salary}. Returned salary: {salary}'

Now, we repeat the same for every other business case we need: part-time employees, contractors, etc.

How to keep the tests clean

In our tests, we gut the dataframe out and get individual values. The test contains lots of code describing how to retrieve the value and not so much actual business logic verification.

I would keep it like this if we had only one or two tests. But when we write more, I suggest creating an Assert Object to encapsulate the data access. We will see only the relevant information in our test code and no technical details.

In our example, we can create an Assert Object like this:

FULL_TIME_EMPLOYEE = 'abc'

class SalaryAssertions:
    def __init__(self, output_data, employee_id):
        self.employee_data = output_data[output_data['id'] == employee_id]

    def check_salary(self, expected_salary):
        salary = employee_data['salary'].values[0]
        assert salary == expected_salary,
            f'Expected salary: {expected_salary}. Returned salary: {salary}'

        return self

Note that I have also put the employee identifier into a constant.

When we have the AssertObject, our test code will get much easier to read:

def test_fulltime_salary_calculation(output_data):
    SalaryAssertions(output_data, employee_id=FULL_TIME_EMPLOYEE) \
        .check_salary(expected_salary=5678)

I recommend returning self in every check in our Assert Object. The code looks much better when we can chain the test methods:

def test_fulltime_salary_calculation(output_data):
    SalaryAssertions(output_data, employee_id=FULL_TIME_EMPLOYEE) \
        .check_salary(expected_salary=5678) \
        .check_bonus(expected_bonus=2000)

If the underlying dataset changes, we have only one place where we have to modify the test code - the SalaryAssertions class. For example, if we decide to rename the id column to employee_id, we have to update only the Assert Object’s constructor. Everything else stays the same.

Refactoring legacy code after adding tests

We haven’t finished yet. If you added enough tests to cover all relevant business cases, you are in a position to refactor the implementation safely.

First, we must extract the data transformation into a separate function.

Let’s move the code around until the first part of the script does all the data loading, the middle transforms the data, and the end of the script writes the output:

load_dataset_A
load_dataset_B
load_dataset_C

join_datasets
calculate_output

store_the_output

Now, we can move the transformation to a function:

def transform(A, B, C):
    join_datasets
    calculate_output
    return output

load_dataset_A
load_dataset_B
load_dataset_C

output = transform(dataset_A, dataset_B, dataset_C)

store_the_output

Now, we should run our previous tests. They will verify whether we haven’t broken the code. While refactoring the code, you should make small changes and run the tests after every change. There is no safer way to restructure your code.

Because the extracted function has no side effects (it doesn’t read or write any data to files or databases), we can run the transform function without the data access code. Therefore, we can write fast tests. Those tests keep all their data in memory and don’t require extensive data preparation.

Now, we create another test file (we keep the old one!).

We will prepare the minimal dataset required to test every business case in this test. For example, we will repeat the test of full-time employee’s salary calculation, but this time we need only the relevant data:

def test_fulltime_salary_calculation():
    expected_salary = ...
    working_hours = [...] # working hours of the one example employee
    hourly_rate = [...]
    employee_data = [single_row]

    result = transform(working_hours, hourly_rate, employee_data)
    salary = result['salary']

    assert salary == expected_salary, f'Expected salary: {expected_salary}. Returned value: {salary}.'

We are doing everything again, but this time without loading the entire input dataset. Of course, when we start repeating assertions, we can create another Assert Object to keep the tests clean.

We have the entire business logic covered with fast tests when we finish.

At this point, we can remove the slow tests - those running the entire script with large input datasets. Of course, you can keep those tests to run them only in CI/CD as a final check before deployment.

Now, it’s time to extract smaller parts of the transformation and test them independently.

We follow the same process as I described above. When we have a larger part of the code tested, we can safely extract functions and focus on testing them properly. Later, you can decide whether to keep the broad-scope tests or remove them because you have more granular tests.

But this will take lots of time!

It will take lots of time.

Adding tests to existing code takes way more time than following the Test-Driven Development process. Also, you won’t cover all business cases when you write tests for legacy code. You will always leave some cases untested.

Moreover, late tests may lock your code design because you wrote tests checking whether you called a function. That’s usually a symptom of over-using mocks or writing too detailed tests. I bet you have worked with a codebase where you couldn’t refactor anything without breaking a bunch of tests.

Remember to test the observable behavior instead of verifying whether the implementation calls dependencies with a parameter. Always. It doesn’t matter if you do TDD or add tests to existing, legacy code.

Older post

Software engineering practices in data engineering and data science

How to produce high-quality software in data teams by applying software engineering practices to data science and data engineering

Newer post

CUPID properties in data engineering

Does it make sense to use SOLID principles in data engineering? What about CUPID properties in data pipelines?