Software engineering practices in data engineering and data science

Almost everyone can write code that works once on their machine. Software engineering makes code useful in the long term. Creating code that is easy to understand and modify by other people makes the difference between programming and software engineering.

Within Google, we sometimes say, “Software engineering is programming integrated over time.”

Programming is certainly a significant part of software engineering: after all, programming is how you generate new software in the first place. If you accept this distinction, it also becomes clear that we might need to delineate between programming tasks (development) and software engineering tasks (development, modification, maintenance). The addition of time adds an important new dimension to programming.

Software Engineering at Google by Titus Winters, Tom Manshreck, Hyrum Wright

Often, the code produced in data science/data engineering teams violates many software engineering principles. Why does that happen?

Machine learning feels different than other kinds of software. Machine learning engineers deal with complex problems. They may lack the mental energy to care about code quality in addition to training the models and preparing the data. It’s easy to imagine that the rules don’t apply to such code.

Data engineering seems different too. Batch pipelines run once a day (or a few times every day). You may write them in SQL, which differs from other programming languages. Data pipelines consume multiple inputs and may produce numerous outputs during a single run. Data pipelines aren’t similar to backend software.

However, the software engineering principles are crucial when we want to build maintainable data projects, get predictable results in production, and sustain development speed.

(To shorten the text, I often say “data engineers” when I mean all people working with data: data engineers, data scientists, data analysts, machine learning engineers, etc. Please keep that in mind.)

Automated testing for data engineers

We are fooling ourselves by thinking we can manually test data pipelines (or machine learning pipelines).

What would you do? Would you count the number of produced results? That’s not enough. You can look at several samples (most likely, the first ten results). What if only one value is incorrect in a dataset of 10 million records? Would you ever spot the error?

Developers working on data have a massive problem with defining test cases.

Data engineers often don’t know how to test data transformations because they aren’t aware of the business cases within the data domain. How can you tell whether your code handles erroneous input data correctly if you don’t know what values to expect?

How can we define the use cases and understand the business domain? Ask yourself questions such as:

  1. What values are possible, and when do we see them?
  2. What does incorrect data look like?
  3. Is there any case when the “incorrect data” is, in fact, correct?
  4. Why do we have incorrect values in the first place?
  5. What happens once in 100000 cases but is a correct result?

It’s not enough to get a set of inputs and their expected output! Sure, that’s enough to write a test case, but you need a name for the test case too! It isn’t enough to tell “this happens.” You must know why it happens! Put that description in the test name or, if it’s too long, in a comment within the test code. Nobody will ever read if you link to a page with the description.

Naming a test case seems trivial. It isn’t! A descriptive test name will help you understand the code when you have forgotten the details.

How should you write the tests? I suggest using Behavior Driven Development because it lets you separate the business aspect of the test (what happens, why, how do we know it if it happened) from the technical part (how do we implement it in code).

How to write tests

You are wrong if you think your job is done after writing several unit tests.

First of all, what’s “unit?” I hope you don’t think that unit means a function or a method. Please, test a unit of behavior — an action we can test independently from others.

“Always design a thing by considering it in its next larger context.”

Eliel Saarinen

Also, your code won’t run in isolation. It has to cooperate with other software. Every software system has an API or an expected data format it produces. Don’t break other people’s code! You should test whether your code behaves in accordance with the API contract.

At some point, your test code may get complicated. Don’t let it rot. Your test code is as important as the production code!

When you start writing convoluted assertions and copy-paste them between tests, encapsulate them in an Assert Object. It will save you lots of time.

Input validation in data pipelines

Lack of input data validation is often a symptom of a data team that doesn’t understand the business domain.

When you have the tests, you know what data you expect. Perhaps, you can filter out the incorrect values before they become a cause of errors. That’s why validation comes after the tests. You need to know what you are building and how it should work before you can tell whether the data you receive will help you achieve the result.

What do we do with incorrect data? Don’t remove it! I suggest putting it in a separate location, for example, “an incorrect data bucket.” You should gather metrics regarding the number of incorrect values rejected. If you typically reject some values, perhaps you can have a threshold-based alert (or use anomaly detection). If such a thing usually doesn’t happen, even a single incorrect value should trigger an alarm.

Many developers don’t like writing tests, but Christopher Bergh thinks that “tests are the gift that you give to your future self.” It is also the most effective way to improve data quality.

Christopher Bergh calls it “defensive data coding.” It is all about not letting the bad data get into the production pipeline.

He advises that we have exception tables where we store the bad rows and report them back to the data providers. This increases the odds that the problem gets fixed upstream, and we won’t keep fixing and patching the same thing over and over again.

An excerpt from my interview with Christopher Bergh

Make it easy to access the rejected data. You should include the reason why you rejected the values! If you can retain the version of the data pipeline running at that time (for example, the hash of the git commit), you will have a debuggable data pipeline — a true unicorn among data pipelines.

CI/CD for data science

Now, we have to ensure that someone else can run your code. We know it runs on your machine. Perhaps, even the tests run on your machine. What about other people? Will the code run correctly on their computers too?

CI/CD pipelines create a new runtime environment for every build pipeline run. If your tests (and the tested code) can run in such an environment, other programmers will likely reproduce the same results on their computer, and, most importantly, it will work in production too.

It doesn’t matter how you run your CI/CD pipeline. You can have a Jenkins server, AWS Code Pipelines, Github Actions, etc.

What does matter is that the pipeline must not require any preexisting environment to run. It must be self-contained. If the pipeline needs setup, the setup must be a part of the build script and happen every time you run the build. Otherwise, the build environment will diverge from the development environment over time, and you will never know whether you can trust your CI/CD.

Do I have to add that you must NOT deploy the code in production before the CI/CD pipeline runs all tests AND they all pass? I have to.

I had seen a production data pipeline where production deployment happened before CI/CD started. The “deploy first, test later” approach detects errors too late. It was common that a programmer merged the changes to the main branch at 16:50, it got deployed to production, and a test failed at 17:05 when the author wasn’t available anymore. The pipeline didn’t automatically revert anything, so the code worked incorrectly until the next business day. Don’t do that.

Cohesion and modularization of data pipeline code

What would happen if I mixed the topics in this article? Would it be easy to read if I wrote one paragraph about testing, followed by a section about input validation and a short mention of CI/CD before switching back to writing about testing? What if I was changing the topics in every paragraph? You would already stop reading. So why do programmers write code like this?

We see lots of code with different aspects of the task mixed. The programmers start retrieving data from one data source and do some preprocessing. After that, they realize they will need another data source soon, so they load another dataset and go back to preprocessing the first one.

What if we mix levels of abstraction? We use Domain Driven Design to write high-level business-oriented code. Suddenly, we need something from the database, so we open a connection, send a SQL query, and extract the result from the received data structure.

The offense wouldn’t be so big if we encapsulated the data retrieval in a separate function, but often we see technical detail inlined in the business-related code.

In my article about cohesion, I explained why cohesion is the bedrock of 10x software architecture

“A highly cohesive software is code that serves only one purpose and minimalizes noise.”

10x software architecture: high cohesion

Mixing levels of abstraction is only one symptom of low cohesion. In programming languages where the use of ORMs is common, we often see the ORM entities used throughout the entire codebase — the storage data model becomes the domain data model. Never do it! Otherwise, you will need to compromise on modeling the business domain and defining the database schema. Most likely, ORM will force you to create a solution ill-suited for both purposes.

Code reviews in data teams

How do you ensure the entire team follows software engineering practices? How do you teach new team members? Pair programming would be the best approach. It may look weird initially, but it’s a perfect social activity where programmers work and learn together. You can do it in person or online. The form doesn’t matter as long as we benefit from using collective intelligence to develop software.

However, pair programming isn’t always practical. In one of my freelancing side-projects, the other team’s working hours overlap almost entirely with my full-time job. I can talk with them for only one hour a day. We must do everything asynchronously. In such cases, code reviews are the only way to collaborate on the project.

You should do code reviews even if you practice pair programming. A third person should look at the code, or the pair of authors should reread the code in solitude to ensure they still understand it when the other person isn’t around. Code reviews have no downside. You should always do them.

Of course, being a reviewer is a different skill, and people may need time to learn it. Programmers need to differentiate between a helpful comment and nitpicking. It takes time to understand that “I would write it differently” doesn’t mean “wrong.”

The code review process differs between teams. Some teams want the reviewers to point out everything. They focus on detecting convoluted logic and difficult-to-understand architecture choices as much as finding typos and missing punctuation. On other teams, the reviewer comments only on things that would be difficult to them if they suddenly had to be the only person maintaining the code. Choose the level of review scrutiny most suitable for your team. Whatever works is good.

If you want to learn more about software engineering practices in data engineering, take a look at my other articles:

Older post

How to sort a Pandas DataFrame by month name

How to use an ordered categorical variable to sort a Pandas Dataframe by months while displaying their names

Newer post

How to add tests to existing code in data transformation pipelines

How data engineers can write tests for legacy code in their ETL pipelines without breaking the existing implementation