What is the difference between data lake, data warehouse, and data mart

We are building some data infrastructure, but what is it? What is the proper name of that thing? In IT, we usually have big problems with naming concepts, so there is no surprise that we ended up with three very similar names to describe ideas that are fundamentally different.

There are at least three differences between data lakes, data marts, and data warehouses. We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.

Data warehouse

Typically, a data warehouse is a centralized database used for analytics. When we design a data warehouse, we usually have some knowledge about the queries we need to execute in that database, so we prepare the schema specifically for that kind of usage. Because of that, we usually decide to use denormalized schemas, for example, the snowflake of the star schema.

We expect that a data warehouse contains data that is “ready to use,” so the data loaded into the data warehouse is already validated and preprocessed. Often, we assume that the data warehouse is the “single source of truth” and always contains correct values.

Data lake

In the case of a data lake, we have a centralized storage of raw data. Typically, multiple transformations process the same raw data, so we end up with many data structures derived from the same source. The primary purpose of a data lake is to have everything in one place to make it easy to join multiple data sources.

It is common to validate the data lake content at the time when the data is used, not when it is loaded into the data lake. On the one hand, that causes data quality problems because there may be multiple validation pipelines that are not compatible with each other. But on the other hand, it allows defining what “valid” means in every data transformation (after all, it does not need to mean the same for everyone).

Data mart

A data mart is easy to define. It is just a decentralized version of the data warehouse. Instead of building one large system for the whole company, we create smaller analytics tools for every department.

That kind of design allows us to apply the domain-driven development concepts to data storage. We can isolate domains, define concept names based on the domain-specific ubiquitous language, and not worry about multiple definitions of the same name.

Older post

Three biggest traps to avoid while setting Spark executor memory

Apache Spark is wasting a lot of RAM!

Newer post

Four books to boost your programmer career

I quit my dream job because of a book