Data versioning with LakeFS

I have to admit that this blog post looks like an advertisement, but LakeFS is a free, open-source project created by Treeverse. I recently talked with Einat Orr - the CEO of Treeverse - who explained to me what problems they are solving and how LakeFS works.

What is LakeFS? LakeFS is the non-stick frying pan of data engineering. Seriously. Technically, you don’t need that tool, but the improvement in quality of life when you start using it is tremendous.

There are many problematic aspects of building a data lake. However, what bothers me the most is creating a test environment and stopping consumers from reading incomplete data sources.

What do we do when we need to test a new data pipeline? There are two options. We can either use the production data as the test source and write to a separate test directory. It is ok, but the production data changes all the time. It would be great to have a test set that remains constant for a few days. Perhaps, we should copy the data to a test bucket in S3? Sure, it works. I bet that you have not updated your test bucket in months. It is just too much of a hustle. Also, you probably have multiple test buckets. Which one can be safely removed? Are you sure that nobody uses a test bucket in production? Such a manual setup quickly becomes a mess.

What about producers and consumers? How do we inform other services or parts of the data pipeline that the data is ready to be consumed? What do you do? Create a vast dependency graph in Airflow to prevent the consumers from running? Store marker files to indicate whether the directory is ready to use? Store the statuses in databases?

LakeFS is a version control system for data lakes that solves all of those problems.

We can create a test branch from the production data, use it to test the pipeline, and remove it when we no longer need it. Not impressed? LakeFS creates a copy of the data in seconds! How do they do that? First of all, they don’t make a copy. It is all metadata. LakeFS provides an API wrapper for S3 API that intercepts the calls and points your software to the correct file version based on the path and branch. LakeFS implements copy-on-write replication, so no copy is created until you write data to a file. You can create as many branches as we need without increasing the storage cost.

Because we have branches in the data lake, we can implement a data equivalent of the CI/CD pipeline. We can ingest data in a separate branch, run the validation code and preprocessing in that branch, and merge it into the main branch when everything is ready to use. No more marker files, status databases, or errors caused by reading incomplete data.

Do you need to reproduce an error or revert a failed batch job? Have you ever attempted doing it in an S3 bucket with versioning? It is possible but very time-consuming. With LakeFS, you can revert to a previous commit using just one command - like in Git.

How does it work in Apache Spark?

We need two modifications to the code. First, we have to redefine the S3 API endpoint. That is where the “magic” happens. LakeFS implements S3 API and overwrites some data access operations. That’s why we can access files in a branch, make “copies” by creating multiple aliases to the same version of a file, and perform atomic merges on terabytes of data almost without changing the application code. It is all hidden by the API wrapper.

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.lakefs.example.com")

After we replace S3 with LakeFS API, we can access files in the repository buckets. The naming convention is quite simple. The repository id becomes the bucket name, the first “directory” in the file key denotes the branch:

repo = "repository-id"
branch = "branch-name"
dataPath = f"s3a://{repo}/{branch}/path-to-the-file/file.parquet"
spark.read.parquet(dataPath)

Is LakeFS worth the effort?

The setup is not complicated. However, installing additional tools always looks like overkill when you start building a new data lake. It makes no sense to install it when you have one Spark job. Over time it gets even harder to justify the effort. After all, if you could handle 40 Airflow DAGs and a few dozen S3 buckets without LakeFS, you can probably handle 41 DAGs too.

We continue making excuses until that one day. The day when something breaks, and we immediately regret that we did not install LakeFS three months earlier. Do yourself a favor, and don’t wait until it happens.

Older post

How to add custom preprocessing code to a Sagemaker Endpoint running a Tensorflow model

How to customize input/output of a Sagemaker Endpoint running a Tensorflow model

Newer post

How to speed up Pandas?

Is the Pandas library too slow? Here are two methods to speed it up!