ETL is the process of extracting, transforming, and loading data into a data warehouse or data mart. ELT is the process of loading all data into a data lake, where it is then transformed and processed.
What’s the big deal? The difference between ETL and ELT seems small. Yet, choosing the wrong data loading pattern makes data engineering a nightmare.
Why do we need ETL or ELT?
Both are popular methods of preparing data for analytics workflow. Our goal is to create a separate copy of production data in both cases. We need a copy for several reasons.
First, we don’t want to hammer the production database with analytics queries. Increased load on the production database can affect our users, and they may complain about slow responses or failing service. Nobody wants to be the person/team responsible for user complaints.
Second, when we have a snapshot of the production data, we can create a report for any available point in time. The production database always contains the current state. “Going back in time” requires polluting the production database with previous versions of data. Optionally, we could store an events log to calculate the state from the past. Data snapshots simplify a lot. Last but not least, getting a copy of all production databases in one place allows us to combine data from multiple sources. Of course, we could achieve the same result without copying the data. We could use Trino or Spark to combine values from various production sources. However, this would affect the production services even more than running separate analytics workflows in a production database.
Do you need to choose one data processing pattern?
Now, we know why we want to have a separate data warehouse or a data lake.
Which data processing pattern should we choose? Do we need to choose? What would happen if we had a data warehouse and a data lake simultaneously?
The most obvious consequence is a considerable increase in cost.
S3 storage is cheap, but the size of stored data increases quickly in a data lake. After all, we keep adding new data, store more snapshots, and nobody wants to remove anything.
Running a data warehouse isn’t cheap either. Like a data lake, the amount of data stored in a data warehouse keeps increasing forever.
Having both solutions at once makes us burn money twice as fast, but does it give us any benefits?
What about computing derived data? Can we easily combine data from a data warehouse and a data lake? As always, we can use compute engines like Trino or Spark. In fact, that’s our only option unless we want to load fresh data from S3 into the data warehouse before every query.
I haven’t mentioned the maintenance cost yet.
Is the same team who has experience in working with Apache Spark capable of dealing with a Snowflake data warehouse? Sometimes yes. Usually not. Especially when the data engineer’s egos prevent them from admitting they have no clue how to design the schema for a data warehouse.
Now, imagine we have two pipelines. One pipeline for the data lake and one for the data warehouse. Those two pipelines copy data from the same data source. Does it mean we could get an exact copy of the data in S3 and the data warehouse? Unlikely.
We all know that having code doing the same thing in two places, in two different technologies, won’t reliably produce the same results. At some point, we will make a change in one place and forget about the other. The SQL dialect in the data warehouse and the compute engine will likely have subtle differences, making them produce slightly different results.
It’s best to stick to one solution and avoid those problems. But which one?
The pros and cons of ETL
The most common data processing pattern is extract-transform-load or ETL for short.
In ETL, we first extract data from multiple sources. Then, we apply transformations to the data. For example, we calculate aggregates or join values from multiple tables. Lastly, we load the data into a target system, typically a data warehouse.
There are many benefits to using ETL.
First of all, it’s a well-established pattern. There are many tools available to help us build ETL pipelines, such as Stitch and DBT.
Second, ETL requires substantial initial effort. And that’s a good thing!
Before we even start loading the data, we need to know how we want to use it. Otherwise, we won’t create correct transformations and data schema. Because of the required effort, we tend to keep only necessary data in data warehouses. It’s unusual to see a data warehouse with useless data or data nobody needs anymore.
However, there are also some drawbacks to using ETL. It’s a complex pattern. Building ETL pipelines requires a lot of expertise and experience. If we choose the wrong data schema or create incorrect indexes, the data access will be painfully slow.
Also, sometimes it’s impossible to fix mistakes in a data warehouse design. What if we have decided we don’t need some data and later it turns out it’s necessary? It may be no longer possible to backfill the missing values from production.
The pros and cons of ELT
Extract-load-transform or ELT is a slightly different pattern. In ELT, we first extract data from multiple sources and load it into the target system. Then, we apply transformations to the data in the target system. The most common target system for ELT is a data lake.
The most significant advantage of using ELT is that it’s much simpler than ETL. We don’t need to think about the data schema or transformations in advance. We can load all the data into the data lake and apply the transformations when necessary.
However, there are also some drawbacks to using ELT. The biggest drawback is that it requires a lot of computing power. If we want to apply transformations to a large dataset, we need to have a lot of servers or use cloud services like AWS EMR, AWS Athena, Trino, etc.
Another drawback is the useless data we store in the data lake. Many data sources we copy from production never get used. Yet, we still copy them because “you must have everything in the data lake.”
Skyrocketing storage cost isn’t the only problem. Think about the morale loss when the data team keeps fixing data ingestion code that copies data nobody needs.
How to decide which one is right for your business
So, which one should you choose?
The answer is: it depends. If you have a lot of data and you’re not sure how you want to use it, ELT is a good choice. It’s simpler and cheaper to get started with ELT. Also, if your data lake contains snapshots of production data, you can later move all of the data into a data warehouse. It won’t be a big problem.
However, switching to a data warehouse will be a significant money and time investment. Such an investment is difficult to justify when you already have a functioning data lake. Therefore, data engineering teams who started with ELT tend to stick to it. Although, they would probably say they “got stuck with ELT” instead of “we decided to stick to ELT.”
If we consider only the technical issues, ELT is better when you don’t know whether you need a data warehouse. It gives you more flexibility. Of course, organizational constraints may still make it a poor choice.
If you want to learn more about data engineering and building trustworthy data pipelines, sign up for my newsletter!