What should you learn as a data engineer?

Learning isn’t easy when you are a data engineer.

Every day someone releases a new tool or a new open-source library. Could you learn all of them? We instinctively know we should ignore most of such new things. What about industry-standard tools or “de-facto” standards? Should we spend time learning them in detail?

We will never have enough time to learn everything. How should we prioritize? What is the most important?

“It is about the witch/wizard, not the wand”

Tools come and go. Deep knowledge about one tool won’t be useful forever. In fact, deep knowledge about any tool is rarely required, so learning it may waste your time.

Even Airflow, Apache Spark, and AWS S3 will be irrelevant one day. It is hard to imagine, but even an industry-standard may become obsolete. I own a useless piece of paper reminding me about that. I started my IT career as a network administrator, so I attended a course about Novell Netware. Have you ever heard about it? Netware was an important network operating system that existed for over 25 years. Until it didn’t.

Of course, some of the knowledge about Netware was useful also when I was using Linux or even Windows Server. Every class of tools has some shared concepts or transferable knowledge. Some skills are transferable between technologies no matter what you are doing. You can use them in data engineering, frontend, or even mobile development.

In this article, I will focus on such skills applicable across multiple technologies.

I assume you don’t look for the first job

Before we start, I must warn everybody who is looking for their first data engineering job. This article won’t help you. Here, I write about what you should learn when you are already working as a data engineer.

If you are looking for the first job, you will need to know the basics of a few standard tools. You should read the tutorials and practice using a workflow scheduler (probably Airflow), batch data processing engine (Apache Spark, Trino (a.k.a. PrestoSQL), Hive, etc.) or a data warehouse (Redshift, Snowflake) or a stream processing engine (also Apache Spark, Flink, Kafka, etc.). Pick one and focus all your efforts on learning it. Nobody will expect more from a junior data engineer (at least, reasonable people won’t expect more). In addition to that, you should get familiar with some cloud storage services (S3 or Google Cloud Storage).

In general, it is good to know what tools are used by the company where you want to work. You can learn it from the job offers (although companies often exaggerate the expectations) or ask current (or former!) employees. Read the company blog, blogs written by the programmers working there, or their Stackoverflow questions if you don’t want to ask.

It will be enough for a start. You may reread this article after your first year as a data engineer.

Transferable skills

What are the transferable skills in data engineering? I made a list of my suggestions, starting from the most data-specific to the most general skill useful for programmers.

SQL

SQL has always been the language of data.

Even when we started using Hadoop or the first versions of Spark, we couldn’t replace SQL with anything. We built abstractions on top of those tools to support SQL (Hadoop -> Hive) or developed them until we could write data processing instructions in SQL (Spark). Now, we process data streams using SQL-like language (ksqlDB). Even QLDB (Blockchain on AWS) uses a SQL-like language for queries.

SQL is here to stay. Forever. We extend tools not related to SQL-databases to support relational algebra and get a query language. SQL won’t go away. That’s a good thing because all we need is learning one top-level language. We don’t need to worry about the implementation running our SQL queries, do we?

That’s not entirely true. Sometimes SQL hides too many details. We can use the same SQL query to retrieve data from an RDBMS and AWS Athena. Still, there is a considerable difference between indexes in a relational database and partitions in AWS S3.

Understanding such differences is a crucial part of the next skill.

Software architecture

As data engineers, we will (probably) not need Domain-Driven Design, but we have a few architecture concepts applicable to data processing.

When we talk about data processing architecture, we should consider a few levels of abstractions — starting from the general top-level idea through processing software, database models, or file formats.

In addition to the old-school batch processing, we invented a few new things over the years — all of them with catchy buzzwords names. We must choose between batch, batch + streaming (Lambda architecture), or stream-first architecture (Kappa).

What about storage? We have object storage in the cloud, key-value databases, streams, blockchain ledgers, graph databases, relational databases, and, of course, data warehouses. Speaking of data warehouses. We can use them or choose to build a data lake or a collection of data marts instead. Additionally, relational databases have their own architecture patterns applicable to data modeling. Did you know there are ten(!) normal forms of database relations?

When it comes to data processing, we may abstract everything away using SQL, but in the end, we need to know whether we run the query in an RDBMS or using a series of map-reduce operations. The underlying implementation matters when we get surprised by long processing times. We should know whether we need an index (or data denormalization) or if we are dealing with skewed partitions.

All of that will happen only if we manage to get the data in the first place. This may involve database replication, event-driven processing, change data capture, or simply downloading a database dump over FTP.

Unfortunately, all of those concepts matter when you have hundreds of terabytes of data. Even a trivial mistake like choosing the wrong file format may cost you a few additional thousands of dollars every month when you work with huge amounts of data.

Functional programming

In data engineering, things get hectic quite quickly.

Having a simple computation model helps a lot, even if such a model exists only in your head.

In general, data engineering tools have nothing in common with functional programming. However, our mental model may picture data pipelines as a series of functions. If, in addition, we assume the input data stays immutable, and we pass all required information as parameters (including the time), we have pure functions. At least, at some high level of abstractions, we can imagine those concepts as pure functions.

Functional programming teaches you a different way of thinking about the code and the data. I think every data engineer should learn a functional programming language. Even if we don’t use them on a daily basis, the concepts we know are valuable years after you forget how to define type bounds in Scala.

Of course, even functional programming won’t help you if you are unsure whether your code works correctly.

Automated testing

Frontend and backend developers can get away with testing the application manually. Good luck doing that in data engineering.

Please don’t be a data engineer who checks only whether any output was produced and looks only at the first 2-3 results. Of course, nobody expects you to read TBs of data. That’s why automated testing is the only way for data engineers.

You can do TDD in data projects too. After all, you need nothing more than input data and the expected output. It is sufficient to have a handful of examples for every business use case occurring in production. In addition, you need a few examples of the weird values you want to filter out or automatically fix in your pipeline.

In data engineering, TDD isn’t fast. You won’t see the test results in two seconds if you need two minutes to start a Spark cluster unless you keep one cluster running for the entire day. Still, even two minute long waiting time is a tremendous improvement over doing the same checks manually.

Last but not least, you will need performance testing. Skewed partitions are the curse of data engineering. You make one small change in the code, and suddenly the pipeline runs three times longer. The size of the cluster doesn’t matter when one worker gets 80% of the data. You don’t want to see that in production, do you?

Writing

So many decisions to make. How will you document them?

Do you think people will remember why you have one pipeline writing Avro files instead of Parquet like every other pipeline? Will you remember why some pipelines start their own EMR cluster while others use a shared one? What about missing data? What if 8 hours of events are missing on 2018-07-26? Were there no sales? Was there a bug in the code? Did it happen because of a fire alarm in the warehouse? You may wonder whether you should care about it. If you are using that data to train a machine learning model, it may still matter.

You will need to communicate with other programmers, and there is no better way than writing. Sure, you can have a meeting with everyone. But tell me… how will you organize a meeting with people who join the company two years after you quit? You won’t, but they can still read the documents you have written.

Of course, you can record a video instead of writing. However, speaking is the slowest communication method, and videos need more storage space than the same information in a text file. Also, videos (and podcasts) are much better when the author writes down a script before they record.

If all they have are bullet points, the authors must think as they speak. In such a case, they use lots of filler words. What’s even worse, they suddenly recall an essential detail while talking about unrelated stuff. Nobody likes when helpful information is scattered chaotically throughout the entire two hours-long video.

Write things down even if you want to record a video. Write things down because it will show you what you don’t understand well enough. Write things down to organize your knowledge and make better decisions. Write things down because you can’t remember everything.

Write things down.

Knowledge about the business

Do you have an excellent idea for code improvements or a new feature? How would you know whether it makes sense?

Imagine you have a batch job running once a week. It gathers the sales data from the entire week and generates a rollup table that gets rendered as a PDF file later. You could rewrite it as a data stream and create a nice web dashboard. Should you do it? You may think you should. After all, the business will get the updated data every two minutes instead of waiting the entire week. They will have a dashboard! Everyone loves dashboards, don’t they?

What if the report is used in one meeting that happens once a week? What if people spend 20 seconds glossing over the numbers and talk about it only when they see something unexpected. Would it still make sense to invest time into making the dashboard and implementing stream processing?

Often the best improvements from the business perspective are things we, programmers, consider boring or unimportant. Once, I changed the plans of the entire B2B department by presenting their data on a map. They had assumed the location of our top business partners, but nobody bothered checking the facts. The map showed the truth.

It pays off to know at least the basics about the business. It makes you qualified to ask the most important questions, like “Does this idea even make sense?”

About being an expert

What do you think? Would you like to scream, “If I listen to you, I will never be an expert! You need to specialize!”

First of all, data engineering is a specialization already. Most likely, you know the services of one cloud provider better than the others. It is a specialization too. You have experience or preference of batch over streaming or streaming over batch. That’s yet another specialization.

Also, we have a skewed perception of specialization. In IT, people want to be experts in using a particular tool. It does pay off to know everything about Apache Spark. However, such knowledge isn’t nearly as valuable as knowing everything about solving performance issues in Apache Spark processing data stored in an S3-based data lake.

Outside of IT, experts specialize in solving a particular class of problems, not using a specific tool. Let’s focus on the problem domain, not the solution domain. Otherwise, we risk working hard for months to solve a problem that doesn’t exist.

Older post

Shadow deployment vs. canary release of machine learning models

What is shadow deployment in machine learning? What is a canary release? What is the difference?

Newer post

How does the Atlan data platform help you ensure data quality?

Atlan - a tool for facilitating a collaborative data culture