Data pipeline documentation without wasting your time

Documenting an ETL is a daunting task. It’s difficult to document a data pipeline because you never know what you may need in the future. It gets even worse when you write documentation about an ML inference pipeline.

Let me show you how to document the ML inference pipeline. If you have an ETL, skip the parts about the ML model. Everything else will be similar.

Where should you store the documentation

I start with the storage topic because many data engineers think we need to install tons of software to get the data pipeline documented. I disagree.

Metadata management software complicates the setup. It would be worth the effort if the documentation gets automatically updated when you make a code change or if it could track the entire data flow. Unfortunately, it can’t. You would need lots of custom metadata updating code to get such a feature.

That’s why I think all you need is a text editor.

I recommend storing the pipeline documentation together with the rest of your Architecture Decision Records. If you put them in the same repository that contains the code, put the pipeline documentation in the same repository. If you have a separate documentation repository, use that. If you use Confluence/wiki or any other tool, put the documentation there.

Don’t setup new storage for pipeline documentation. If people get used to using one tool, they won’t look for your documentation elsewhere. Make it easy to find. Also, post a link in the README file in the repository containing the pipeline code.

Is a text file all you need? Upgrading from a plain text file to a specialized metadata tool doesn’t bring as many benefits as creating the first documentation file. From my experience, I can tell that the most significant team productivity boost you get from having the documentation at all, not from a fancy tool.

The minimal documentation

What information do we need in the text file? Let’s start with the information we DON’T need in the documentation.

You don’t need to explain the code

Please don’t write down the details about the processing logic. Nobody will keep it updated when they change the code. Instead, describe it in 2-3 sentences and link to the relevant part of the code and the automated tests verifying the logic. The truth is in the code. Always.

Of course, if your code is unreadable, you have a bigger problem than lack of documentation. You should refactor the code first! Documentation won’t make up for the poorly written code.

Also, if people struggle with reading the code, they won’t update the documentation. They won’t because they are afraid they have misunderstood the logic. Why does it happen? I think I know what people think when they can’t understand the code easily: “If I misunderstood the code and updated the documentation with false information, everyone will know I struggled with the code!” Programmers have tons of insecurities while reading other people’s code and working with the unfamiliar codebase. They will find an excuse (like “not enough time to update the documentation”) and hope nobody asks about their code changes.

Once again: Don’t copy information from the code—link to it.

Document the ownership

At some point, someone will ask why you created the data pipeline and who uses it. Maybe you will wonder whether you still have to maintain it or whether it is time to archive the code.

Because of that, I find it helpful to store information about the person who requested the pipeline. The person may quit, so you should also write down their function and their team’s name.

Karl Brücker may no longer work at the company in a year, so you want to know whom you can ask. The Iron Cobras team may not exist anymore, or they changed the name. That’s why it pays off to write down what they were doing. If you wrote down that those were the B2B marketers, you would find someone who can answer your questions.

Document the purpose

Let’s stick to the example from the previous section. You have an ETL pipeline created for the B2B marketers. Marketing campaigns come and go. Writing down why you created the code will help you figure out whether you should retire it.

At the bare minimum, note the Jira ticket. However, often the Jira ticket isn’t enough. Also, the marketing campaigns have names as random as the engineering teams. Writing it down won’t make any difference.

Marketing teams often reuse datasets between projects. It’s great to reuse the data, but nobody will tell you about that. You won’t know about the dependencies until you remove the pipeline and somebody complaints.

Is there a solution? I don’t think so. Therefore, it should be enough to write a paragraph or two explaining the general purpose of the data you produce. If you also know who uses it, they will tell you whether they still need the dataset.

Documenting the processing logic

As I said before, the truth is in the code. You don’t need to copy the same information in the text. Instead, I recommend linking to the relevant files.

If you structured the code well, you would find a file containing the data ingestion logic, data preprocessing, model training code, and model evaluation code. You will also have a separate codebase for the runtime preprocessing/postprocessing and model deployment code.

It is enough to link to those files.

If you have the information in one place and link it everywhere, you are sure what the truth is. If you have two sources of facts, you never know. Well, unless you work with code. In this case, it gets easier. The truth is in the code. Always. But what… what if you don’t know which code version was deployed?

Data pipeline documentation template

If you need a starting point for your documentation, I recommend using this template.

Data pipeline name

Owner: person, team, function in the organization
Used since: the date when you deployed it for the first time in production

Purpose: one of two paragraphs explaining what you are doing with the data and why you need it
Jira tickets: relevant Jira tickets

Input data: a link to the code retrieving the data from the feature store or a database
Input preprocessing: a link to the code in the repository

Model training: also a link
Model evaluation: another one
Evaluation results: a link to MLFlow or any other experiment tracking tool
Deployed model: a link to the model files - the source directory of the deployment pipeline

Inference code: your inference code repository
Deployment: the deployment pipeline
Deployment logs: the logs of the tool running the deployment code
Inference logs: the location of your runtime logs
Runtime env: for convenience, you can also link to the runtime dashboard or the deployed service

If you create an ETL service, you will link to Apache Spark logs (or whatever you use), the output location, the pipeline code repository, and Apache Spark metrics.

How to figure out the right amount of documentation

If you aren’t sure whether you should write something down in the documentation, ask yourself this question:

Will this be useful when I debug this data pipeline at night?

If you may need it, put it in the document. If no, skip it or put it in a separate section. You don’t want to read useless stuff when you want to fix something quickly and go back to sleep.

Older post

How to run batch inference using Sagemaker Batch Transform Jobs

Running a batch machine learning job using Sagemaker and data stored in S3.

Newer post

Why should you use a feature store

Benefits of having a feature store and what happens when you don't have one