What is the difference between data lake, data warehouse, and data mart
We are building some data infrastructure, but what is it? What is the proper name of that thing? In IT, we usually have big problems with naming concepts, so there is no surprise that we ended up with three very similar names to describe ideas that are fundamentally different.
There are at least three differences between data lakes, data marts, and data warehouses. We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.
Typically, a data warehouse is a centralized database used for analytics. When we design a data warehouse, we usually have some knowledge about the queries we need to execute in that database, so we prepare the schema specifically for that kind of usage. Because of that, we usually decide to use denormalized schemas, for example, the snowflake of the star schema.
We expect that a data warehouse contains data that is “ready to use,” so the data loaded into the data warehouse is already validated and preprocessed. Often, we assume that the data warehouse is the “single source of truth” and always contains correct values.
In the case of a data lake, we have a centralized storage of raw data. Typically, multiple transformations process the same raw data, so we end up with many data structures derived from the same source. The primary purpose of a data lake is to have everything in one place to make it easy to join multiple data sources.
It is common to validate the data lake content at the time when the data is used, not when it is loaded into the data lake. On the one hand, that causes data quality problems because there may be multiple validation pipelines that are not compatible with each other. But on the other hand, it allows defining what “valid” means in every data transformation (after all, it does not need to mean the same for everyone).
A data mart is easy to define. It is just a decentralized version of the data warehouse. Instead of building one large system for the whole company, we create smaller analytics tools for every department.
That kind of design allows us to apply the domain-driven development concepts to data storage. We can isolate domains, define concept names based on the domain-specific ubiquitous language, and not worry about multiple definitions of the same name.
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz