Data Engineering - the first principles

The first principles thinking was popularized in IT by Elon Musk. Well, “popularized” is an overstatement. People are talking about using first principles but aren’t willing to actually do it. Let’s give it a try anyway.

In this article, I will attempt to find the first principles of data engineering. Most likely, we’ll end up with something obvious—an obvious thing that gets ignored on a daily basis.

What are the first principles?

The first principles are the immutable truths about a particular subject. They are rules with no exceptions.

Let’s take a look at an example. I’m reducing my body weight right now. It isn’t the first time I am doing it. However, for the first time, I’m actually getting results.

In the past, I was usually looking for products I should avoid eating or some arbitrary rules like no eating after 19:00. There are tons of “magic diets.” After all, the authors of books about losing weight must invent at least 20 rules because it cannot be easy. An effective diet must be complicated. Right? Not quite. In fact, we’ll most likely fail because of such invented complexity.

What have I done differently? I decided to find the one thing I have to do every day to succeed at losing weight. No fancy rules. No lists of forbidden foods. Just one single direction. The first principle.

What is generally true about bodyweight loss? We have to eat fewer calories than we can burn. This is it. This is the only rule. Although, it isn’t a helpful rule. We must make it more actionable.

We can reduce the number of consumed calories or burn more of them. Of course, there is a limit to how few calories we can consume and stay healthy. On the other hand, when it comes to burning calories, the sky is the limit. Well, realistically, an injury is the limit. Regardless of the method we choose, we mustn’t overdo it. So there must be a rule regarding how much bodyweight we can lose a week and stay healthy. I have not found it. Generally, the advice is to keep the weight loss at the rate of around 0.5 kg per week.

How to lose 0.5 kg per week? Reducing one kilogram of body fat requires burning approximately 7700 kcal, so 0.5 kg = 3850 kcal. We should eat 550 kcal less per day than we can burn on a given day.

That’s the first principle: Reducing body weight by 1 kg in two weeks requires a kcal deficit of 550 kcal per day (as long as the person doing it doesn’t have any medical problems). It is true regardless of the workout plan we choose or the diet we want to follow. It is true even if we allow cheat days when we eat cake all day. As long as the average daily deficit is 550 kcal, we will lose 1 kg per two weeks.

The first principles of data engineering

What is generally true about data engineering? What is true regardless of the data processing method we use? What applies to both batch processing and streaming? What is true when we process 1 TB of data per minute or 20 KB per week? What is true when we use a computing cluster and run a shell script on a single machine?

There is one such rule: Incorrect data is useless. It does not matter how much data we process in one hour if the values are wrong. 300 TB of garbage per day is an expansive mistake, not an impressive achievement.

How can we produce incorrect results? There are two ways. The first one is the infamous “Garbage In, Garbage Out” rule. In short, if the input data is incorrect, we will not get a valid result at the end of the pipeline. Therefore, input data validation is crucial. It is not enough. We should measure the data quality at various stages of the pipeline to detect mistakes.

When it comes to ensuring code correctness, we have two ways of avoiding mistakes in the code. They work best when we combine them. First, we need to understand the business process we are automating. No tests will help us if we implement the wrong thing. The only way to spot such a mistake is to know what is supposed to be implemented. The second problem avoidance method is automated testing.

The first principles of data engineering and the actions we should take because of it:

  • Incorrect data is useless.
  • We can prevent it by measuring the data quality at different stages of the pipeline.
  • To ensure no quality loss inside the pipeline, we need automated testing.
  • To implement the tests correctly, we need domain knowledge.

It is too trivial!

In IT, the basic principles usually tell us to avoid doing stupid things. In IT Security: don’t reuse passwords, don’t try to remember passwords; use a password manager instead. In distributed systems: avoid needles communication between processes, avoid synchronous communication. In data engineering: don’t produce garbage. In frontend development: don’t confuse the users, don’t make them wait too long.

We can build a complex solution only if we take care of the basic stuff. There is no way to make anything complex when we ignore the first principles.

It is not enough

Is there anything missing? Most likely. In the case of bodyweight reduction, I have another problem. I’m peckish most of the day.

Is there anything wrong with the first principle? Hell no. It works as designed. I am peckish because I am losing weight. The problem exists because I am moving in the right direction. Of course, such a slight discomfort may reduce my chance of succeeding, so I must deal with it too.

Similarly, in data engineering, writing tests may slow us down. I think it happens when we write the tests in the wrong way. We will get slowed down if we misunderstood unit testing and test every function separately by mocking all their dependencies. Such tests are also quite useless because they prevent all refactoring.

We should tests units of behaviors - execution paths that we cannot divide into simpler business cases. Our tests should check the observable behavior of the application. We can use mocks instead of communicating with external systems to make the tests more stable and faster, not to test every function in isolation. After all, bugs usually occur in the code implementing cooperation between functions. Of course, we’ll also write a few integration tests. Just to be sure.

Tests verifying a single method are allowed when it is too cumbersome to test all possible cases going through multiple layers of code. Usually, this happens when we test validation. The validation rules may be quite complex, so we test them directly. We can even mock the validation out in the rest of the code and not worry about it again. However, mocks should be the last option we consider.

You are just trying to convince people to write tests all the time!

Yes. Because writing automated tests is the only thing that works. There is no other option. If we want to write the correct software, we need automated tests. Clicking around the interface and testing manually doesn’t work. It takes too much time, and people are not good at noticing minor errors. After all, how many times had you made a typo and caught it after sending the message?

Even worse (for people who claim that testing doesn’t work), we should write at least some of the tests before writing the code. Why not all of them? Because there will always be some additional checks we want to add after we finish the implementation, especially in data engineering. We’ll spot places where data may come in a malformed format, and we will want to check how the code behaves in such cases. It is ok to add tests for edge cases after the implementation. However, I recommend writing the one test for the “happy path” before implementing the code.

Older post

How to deploy MLFlow on Heroku

How to deploy MLFlow on Heroku using PostgreSQL as the database, S3 as the artifact storage and with BasicAuth authentication

Newer post

How to learn TDD

Learning Test-Driven Development is hard and there is nothing we can do about it