Is it overengineered?

Overengineering - the term we often use to describe the code written by… other people.

Last weekend, I wondered whether a solution I was pretty satisfied with was overengineered. Have you ever thought about it? Have you ever looked at the code and asked if the idea you had still made sense? Was it old code? I wrote the code in question less than two months ago.

Infinite log

We have a backend application that stores its data in a Postgresql database. Quite normal. However, its database isn’t a permanent record. We remove data older than a few days every day. The application doesn’t need it anymore.

The application doesn’t need the data, but the machine learning engineers do. They need it to evaluate model performance in production. How can I give them the data?

Keeping it in the backend

The most trivial option is to keep the data in the production database but alter the application logic to avoid retrieving the old values during regular operation.

What’s wrong with such an idea? First, I’d need to modify the application code and introduce rules that have nothing to do with business logic. On the other hand, if the application code were relatively simple, adding such code wouldn’t be the end of the world. The program has a decent suite of automated tests. I could make it work.

However, the damage induced by such a change would hit me later. If I kept redundant (from the application perspective) data in the database, queries would get slower and slower. Sure, I could add indexes to counteract the degrading performance. Unfortunately, every database index is a tradeoff between the query and insert speeds. Adding (and using) an index speeds up queries, but we must update the index during every insert. Of course, the slowdown is negligible. Until it isn’t. At some point, the index update time will get noticeable.

Also, machine learning engineers would need to query the production database to make any analysis. NOT doing such terrible things is the 101 of data engineering. We’re here to prevent such disastrous implementations. Obviously, I couldn’t allow it to happen.

Change data capture

Change data capture looks like quite an obvious solution. Of course, I’d have to ignore deletions. Unfortunately, not every CDC tool supports ignoring deletions. For example, a few years ago, it wasn’t possible in the AWS Database Migration Service. I don’t know whether it’s still impossible because I don’t need AWS DMS now.

What’s even worse, the data I was supposed to write started changing. I heard, “Since you are already working on it. Could you skip those columns? And add a few others instead?” It wasn’t 1-to-1 database replication anymore.

Storing rows in S3

The best solution is to store the data in AWS S3. Of course, we have to rule out the most trivial and ridiculous implementations, such as creating a file directly from the backend application. It wouldn’t work. First, it’d make a vast number of tiny files containing a single data point—an ETL nightmare. Second, multiple instances of the service run simultaneously in production. Some day, two instances would try to write different data points to the same file key in S3, and we’d lose some data or get a runtime error. Again, not a good idea.

Using a sledgehammer to crack a nut

Considering all of that, I decided to use Kinesis Firehose. I use Direct PUT in the backend applications to pass the data to Kinesis, and it automatically stores batches of events in S3. It works perfectly. Nevertheless, I’m not entirely happy with the outcome.

Using Kinesis Firehose is the right solution. However, the volume of data currently processed by the application still makes me think I overengineered the data ingestion. It feels like launching a SpaceX Falcon rocket to deliver a pizza to a house down the street. It’d work, but it’d be better/cheaper/and more environmentally friendly to ride a bike or walk.

Kinesis is too powerful. Most likely, it’ll be too powerful for a few years. Right now, it isn’t easy to imagine we could outgrow such an implementation. On the other hand, the entire code, including gathering the additional data, sending the events, testing for those two behaviors, and even setting up the Kinesis Firehose using Terraform, fits in around 100 lines of code. Every data engineer with streaming experience instantly knows what is going on.

So, was it overengineered?

Older post

Pattern matching in Python vs Scala

What is the difference between pattern matching in Python and Scala?

Newer post

Deploying your first ML model in production

The minimal setup for ML deployment without the things you DON'T need yet