What does your data pipeline need in production?

In July 2021, I asked you when was the last time you spent four hours debugging a data pipeline. I told you it happened to me one night when I woke up at 2 am for no reason. Unfortunately, I reached for my phone and checked Slack. In the channel delivering alerts, I saw lots of errors.

I saw the most crucial data pipeline was failing, and we would waste an entire day fixing the errors. I decided to fix it right away, so the batch jobs running in the morning have access to the valid data and don’t fail too.

Have I learned anything from it? (Besides, not using the phone at 2 am.)

Never allow skipping the tests

The problem was caused by a code change deployed before all tests were executed. The author wanted to stop working at 5 pm and didn’t wait for the tests. Our CI/CD pipeline allowed merges and production deployments without waiting for the test results. We allowed it for two reasons.

Our tests needed around 30 minutes to run. Sometimes we were testing changes locally and didn’t want to wait another 30 minutes to see the same tests pass again. Because of that, we allowed merging while the tests were still running. It wasn’t perfect… Ok, it was stupid. However, we had an excuse that looked valid when we decided to allow such merges. The idea was that if you tested the code locally and you are sure everything works well, don’t waste your time waiting for the tests again.

Also, our testing environment used to be quite unstable. If multiple instances of the tests sets were running simultaneously (for example, triggered by CI/CD and by a developer), they drained EMR resources, and tests started failing. Sometimes we had failures when earlier test runs failed and left some temporary files in the test directories. In short, we built our CI/CD on duct tape and hope.

It shouldn’t be an excuse. We should have invested more time in fixing the test environment. If it worked correctly every time, we wouldn’t need to bypass the tests.

Isolate runtime environments

We wanted to reduce the costs, so for multiple pipelines, we reused the same EMR cluster. It was running all the time, and we adjusted pipeline schedules to use the cluster fully. However, if some of the processing jobs were slower than usual, another could start while the old one was still running.

Sometimes, the EMR cluster had no resources to run new jobs. It was scheduling them, but nothing was happening. Airflow didn’t care about the lack of resources. It was scheduling the jobs and waiting until they finished. However, the scripts were sometimes timing out, and we had pipeline failures. We had to fix them manually (remove the job from the EMR cluster, reset the state in Airflow, sometimes, remove some temporary files).

Our test environment was similar. We had one shared cluster which was running CI/CD tests. Some developers were running their tests using the same cluster. Also, it wasn’t nearly as powerful as the production EMR. After all, the test dataset was only a curated subset of the production data. But, the test EMR was getting stuck too.

At some point, we started the effort to make the production pipelines independent. After the changes, every pipeline launched a new EMR cluster. Then it scheduled the processing jobs and waited for the results. The clusters were setup to close themselves automatically when they finished processing all steps (or fail).

We should have done it earlier.

Documentation!

What is the worst part of debugging production at night? You are alone. You can’t ask questions because nobody will answer them. All you have is the code and the documentation. But only if people have written the documentation and keep it updated.

After this failure, I started writing playbooks for every production pipeline I wrote or modified. The playbook includes all failures I have observed during development or testing. It offers step-by-step instructions on debugging and fixing them.

I believe this is how we should be doing it. If you put something in production, document what to do when it stops working. If you can’t imagine a failure, you are not ready to run it in production.

Is documentation enough? We could do a little bit better.

Fix/Cleanup scripts

We should have scripts for the typical actions we take while fixing a pipeline. If you use Airflow, you have to reset the task state. Of course, doing that in the UI takes a few seconds. But what if you have to reset 20 pipelines because they time out waiting for the data? It gets annoying quite quickly.

What’s even worse, if you have multiple failures simultaneously, you can’t restart everything at once. There are dependencies between the DAGs! Of course, we should configure them properly using Airflow sensors. Still, it won’t hurt to restart everything in the proper order. You don’t want to use Airflow and be a human-scheduler, do you?

However, I don’t recommend automating the recovery completely. The actions should be automated but don’t automate the decision to run the script. Automated recovery often causes even more problems. You don’t want to debug both the original failure and damage caused by the faulty recovery script.

Older post

How to pass a machine learning engineer interview

Trivial (and easily fixable) mistakes that will make you fail a job interview

Newer post

Secrets of mentoring junior software engineers

How to quickly train junior engineers to make them as productive as the rest of the team