Data engineers are data librarians or how to upgrade your data lake to 2500 BCE technology.
How do you imagine a library before we started using computers everywhere?
You’re reading a blog for programmers, so I have to clarify something. A library is a public institution that lends you books for free ;)
What did a library look like without computers? How would you find a book in the building?
You could walk between the shelves and read the titles. Right? Yes, you could use the full scan way. That would work. At least in small buildings. In some libraries, they would kick you out at the end of the working day before you could find the thing you’re looking for.
So there must be a better way.
They sort the books, right? They sort them by genre and author. You could find the part of the building where you are the most likely to find the book. Then, you can go there and start checking what they have on the shelves. It would take some time, but you could find the book. Eventually.
What if you didn’t know what you were looking for? Did they have a search engine? What could you do in a pre-computer library besides asking a friendly librarian?
You could use the card catalog!
A card catalog keeps a record of what is in the library. It’s organized by book titles, authors, and subjects. You can go to the authors’ catalog to find all Stephen King’s books. You can also go to the title catalog to look for all books titled “Cujo.” You would find it in both places. Of course, you could also find it in a subject catalog.
Do you know how long librarians have been using a card catalog? Sumerian librarians used clay tablets as a card catalog system in 2500 BCE! (https://en.wikipedia.org/wiki/Cataloging_(library_science)#History)
What’s the point? What does it have to do with data engineering?
Do you have a metadata catalog? If I ask you for some data, can you tell me the file location without retrieving many files and checking whether you found the correct one?
If you don’t have a data catalog, your data lake uses pre-2500 BCE technology!
Seriously! People have known how to catalog information for almost 5000 years.
In 2022, it’s finally time to upgrade your data lake to 2500 BCE technology!
Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- MLOps engineer by day
- AI and data engineering consultant by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @firstname.lastname@example.org