Are you here because you typed “how to become a data engineer” in Google? What did you see? When I tried googling it, I saw four paid ads and six organic results linking to pages selling online courses or degrees.
Before we start, I have to tell you one thing. Don’t buy those courses yet. Perhaps you don’t need them.
I assume you are an aspiring data engineer looking for your first job in the industry. Most likely, you are searching for a junior data engineer position and read everything that may help you get the job. Good, but don’t believe in everything you read.
Let me explain.
Why most advice for aspiring data engineers is wrong?
When you search for hints for aspiring data engineers, you will see “helpful” articles listing requirements and all skills you need to get the job. What do some of the websites say?
They tell you to get a degree, get certified, and build a portfolio doing data engineering side-projects. According to them, a junior data engineer needs to develop data engineering skills such as coding, relational and non-relational databases, ETL, data storage, automation and scripting, machine learning, big data tools, cloud computing, and data security. They claim you need all of that to… get an entry-level position.
You must be kidding me.
Others suggest that you should obtain a degree in a relevant discipline related to Big Data, complete a few relevant certifications for various big data and cloud computing tools, and learn by working on real-world problems.
A person with that knowledge, experience, and certifications is probably looking for a senior position. You don’t need more than 10% of that to get an entry-level position.
Don’t believe in long lists of requirements. Their authors want to induce a sense of helplessness, so your only hope is to buy courses from them. Courses are a great way to learn when you have a specific goal, not when you buy everything because you are afraid and overwhelmed.
Do you need to work on a portfolio project to get data engineering experience?
Doing portfolio projects will help you learn the syntax or API of the tools you use for data processing, but you won’t know anything about the real-life usage of such tools. Even worse, you will learn bad practices because you can get away with many terrible data engineering ideas when you have enough RAM to load the entire dataset on a single machine.
I don’t know about you, but I don’t privately own a hundred TBs of data that I could use to solve real-world data engineering problems at home. Do you? The advice of working on projects to build a portfolio is well intended but misleading. You won’t learn proper data engineering practices working with 10 GB of data (or even 100 GB).
Do you need a degree to get the first data engineer job?
In general, nothing is stopping you from being a data engineer without a degree. It’s not a regulated profession.
Of course, the culture-based expectations differ between countries. It depends on the place where you want to work. If you are in a country where the government pays (with your taxes) for higher-level education, there is no excuse for not having at least a bachelor’s degree.
According to some comments on the Internet, it seems almost impossible to get an IT job without a technical degree in India. Similarly, programmers without a computer science degree are rare in Poland and are usually perceived as lesser employees until they get a few years of experience.
However, we are talking about a junior data engineer. For a junior, it’s usually sufficient to start getting the degree. You can work while still learning at the university. And you don’t need to finish the degree either. It’s sufficient to be a student while looking for the first job. Nobody will ask you about a diploma if you have more than two years of professional experience. It matters when you are looking for your first job.
On the other hand, a degree isn’t required if you live in a country where you have to pay for your education, and you want to work in a city with a high demand for data engineers.
What do you need to know to become a data engineer?
Every company is different, but I have a simple process of figuring out what you need to learn to get a job.
First, decide where you want to work. If you don’t know, or it doesn’t matter to you, pick a few companies randomly. After that, search for current and past job offers posted by those companies.
Of course, they most likely listed many optional requirements without telling the candidates what is mandatory and what’s nice to have. That’s why you should look for job offers from other (similar) companies and focus on the skills and tools that are common requirements across many of them.
I suggest going to a meetup and asking around to get even better information. Perhaps one of the current employees may give you a hint (or you may meet the recruiter!).
You can also reach out to the data engineers at those companies and ask them. Talking directly to engineers gives you an additional benefit. If some of them ignore you or send you a rude response, you will know which companies you should avoid.
What does every junior data engineer need to know?
If you need a short list of things you, for sure, need to know as a junior data engineer, I think those four things should be enough:
- SQL - learn enough to handle data retrieval from multiple tables
- Python - you should know how to read a file (CSV, JSON, Excel, etc.) using Pandas and filter out rows and columns, convert data formats, and upload the results to cloud storage or a database
In addition, you should pick one of the cloud providers (AWS or GCP, choose the one used by the company where you want to work). You don’t need to learn all of their products. Aspiring junior data engineers need only two:
- the cloud data storage service (AWS S3 or Google Cloud Storage)
- their serverless data query engine (AWS Athena or Google BigQuery)
If you want to prepare a little bit better, read about the database, streaming, and pipeline orchestration services offered by the cloud of your choice. You don’t need details. Knowing which tool you need to use when your team talks about a task is sufficient.
Is this enough?
It should be enough if the company is looking for a junior data engineer.
Unfortunately, people at some companies have way too big expectations because they think that a junior is an underpaid mid-level programmer. Avoid such places if you can.
Should you attend a bootcamp?
In my opinion, buying an online course or attending a bootcamp makes sense in only two cases:
You don’t know what to learn, and you need someone who will guide you
Before you buy a course, try copying the course agenda and learning the topics on your own. If you still get stuck, you may benefit from signing up for the course.
You need to block time for learning
You may be a person who needs to protect your schedule from disruptions. Maybe you cannot focus for long on your own, or your family nags you while you learn. Perhaps, you tend to postpone tasks until you have an immovable deadline.
If you can’t sit in front of a computer for two hours and learn, signing up for a course or workshop may be a good option. Especially if the content isn’t recorded and you must attend live teaching sessions.
I think you don’t need anything else to get hired as a junior data engineer. Most of you don’t need to buy an online course or get a degree.
If you feel you need more, I suggest you read more articles on this blog ;) I have published almost 400 texts on data engineering, machine learning, and software craft.