Definition of done for data engineers
In every team’s life, there is a day when they finally decide to write a definition of done. Usually, it happens after a significant misunderstanding. For example, the team deployed a subpar feature, and they had not even noticed that something was missing. A manager, the client, or the product owner called a meeting and said, “Isn’t it obvious that…?” Well, it wasn’t obvious.
It starts as an innocent meeting. After all, the first line in the definition of done is usually “The code fulfills the acceptance criteria of the user story.” Fair enough. The code should work as expected, so there is nothing unusual about this sentence.
How does it happen that we often end up with “definitions of done,” which tell us that the code must be reviewed, merged, and deployed? Is it a result of peer pressure? You have to say something, and the only thing that counts as a contribution is adding a new item to the laundry list of requirements. Apparently, staying quiet and not making the mess does not matter. In the end, we have a long useless document. Nobody remembers its content, and we never look at it again.
The code works as expected
In this article, I will derive a reasonable definition of done for a data engineering team starting from a simple yet fundamental requirement that “The code works as expected.”
What does it mean? First of all, we have an underlying assumption that we know what the expected output is. Then, we have to test it somehow and prove to others that the code works correctly.
We will most likely write an automated test to verify the acceptance criteria, but we don’t need to state that in the document. Of course, we will write way more tests to verify the individual chunks of code, but, to be honest, nobody cares about that. The tests are for the developers. We will not put unit testing in the document either.
The code will work as expected in the future
If the requirements have not changed, the behavior of the code should not change either. Again, we need automated tests to ensure that but, in data engineering, testing the code is not enough! What if the input data changes? What if we get a field in a different format, empty fields, or the distribution of values changes?
The requirement that the code works as expected implies that we can monitor the behavior in production and get a notification when something weird happens.
What else do we need? We need tests, which are good enough to detect unintended behavior change when someone modifies another part of the code. In this case, the test is our last line of defense that prevents an error. Maybe we can be a little bit proactive here. What if we write extensive documentation that describes the feature and architecture decisions that we have made? Proper documentation may prevent a lot of errors and questions in the future. Therefore, writing and updating the documentation should also be a part of our definition of done.
What do others expect?
What else do we expect? Hopefully, other teams use the code we wrote or the data we produce. They have some expectations too. Hence, we should notify everyone who may be affected by the changes we have made.
We will probably not forget to send messages to other teams when we modify a table structure or a data format. What about changes that affect performance? What if someone needs a report at 9 am, but our change causes a two-hour delay? Because of that, awareness of code performance should also be a part of our definition of done.
Last but not least, the CFO expects that we have the costs under control. If our cloud costs double suddenly, can we say that the code works as expected? In my opinion, we can make a reasonable assumption that the data engineering team should know how their changes affect the amount paid for the infrastructure and storage.
The definition of done for a data engineering team
What do we get when we summarize all of those expectations? Here is my definition of done:
- The code works as expected.
- The code works as expected, which is confirmed by automated tests that verify the acceptance criteria.
- The code will work as expected in the future. Therefore, we monitor vital metrics and have defined alerts that notify us when the expectations are not met. We have also written enough automated tests to detect any unintended change of behavior.
- We will always understand what is the expected behavior of the code. To ensure that, we have written and updated the documentation.
- Others know that the code works as expected. We have notified everyone who may be affected by the code changes, and the solution does not violate SLA (it is not too slow, and the cost does not exceed the budget).
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- What are the 4 V's of big data, and which one is the most important?
- Apache Spark: should we use RDD, Dataset, or DataFrame?
- How to use Scrapy to follow links on the scraped pages
- Three biggest traps to avoid while setting Spark executor memory
- Data streaming with Apache Kafka - guide for data engineers