What are the 4 V's of big data, and which one is the most important?
One of the first models that describe what big data is was the four Vs-model. That definition divides big data into four categories (sometimes called dimensions) of problems: volume, velocity, variety, and veracity.
Volume is the big data dimension that everyone seems to understand intuitively. If it is big, it means we have a lot of it.
Volume is all about the number of TB of data you have. This aspect causes problems not only because of the required storage but also the time needed to query the data.
We try to deal with issues caused by high volumes of data by partitioning the files in a way that speeds up queries, building rollups that summarize multiple data points into one, and sampling.
Velocity describes how fast we get new information. Is it 1GB per day, or is it 20 GB per second? Even people who never processed large amounts of data quickly grasp the idea that we cannot use the same code to handle 1GB/day and 20GB/s.
At some point, daily batch processing no longer makes sense, even if the business does not require more updates than once a day. When the velocity gets too high, we start using micro-batching or data streaming.
In the case of real-time data processing, streaming is the only sensible solution. Of course, technically, you can try micro-batching every second. If you have done it and it worked correctly, let me know. I want to hear that story.
What is the data source? We get it from relational databases, document databases, graph databases, XML documents, CSV, logs, and other event streams. There are tons of data sources, and every one of them is different. Note that I listed only data sources that have at least some coarse structure. What about unstructured data?
Our data pipelines must correctly read data from all of those sources. That requirements cause problems not only because we spend a lot of time writing the code, but also because it needs extensive testing and constant monitoring.
Speaking of monitoring. How do we know that the data we gathered is useful? Do we have big data or big garbage? Veracity is all about validating the data and checking whether our pipelines produce correct and useful outputs.
In my opinion, this is the most critical dimension of big data, and I can easily justify that opinion.
What are the six words that will destroy your credibility and jeopardize your career? “This number does not look right.”
When you hear it, it is already too late. People don’t trust you anymore. Your co-workers are suspicious about everything you have done so far. Suddenly, you become a person who makes mistakes — the person who was probably feeding us false data for years.
At this point, your supervisor starts to wonder: “why do we even keep him/her here?” After all, being a manager of such a person is not comfortable either. You keep hearing questions like: “How could you let that happen?” “What are you going to do about it?” or “Can we trust your team at all?”
If the company makes money and everything is fine, they will probably ask you to fix the problems and let that slide. What if the management made a colossal mistake recently, and the company is on the brink of bankruptcy?
Congratulations, you have been just promoted to the scapegoat position. Everyone in the company will know your name. It may not be your fault, but you will be the one to blame.
The only good side of the situation is the fact that your suffering will not last for long. Even if the company can’t officially prove that you are the one responsible for problems and fire you, they will make your life miserable and manage you out within the next few months.
That’s why I think that it does not matter how big your big data is or how fast it grows. It does not matter how many data sources you process. The only important thing is: “can we trust the data?”
Are there more than four Vs?
Of course, there are. Dozens of people wanted their share of the fame and invented additional dimensions which names start with V.
Some people talk about data “Value.” Those people want you to make sure that the organization is getting value from the data.
Others point out that data constantly changes. Those people describe the changes as the “Variability” dimension. They remind us that we must be ready to constantly improve the validation rules and data source handlers because the structure and the range of correct values continuously change.
For me, both of those dimensions are extensions of the Veracity dimension. It is all about correctness and keeping the data useful.
Of course, there are others. I even saw an article “42 V’s of big data.” It is getting ridiculous, really fast.
Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- MLOps engineer by day
- AI and data engineering consultant by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @email@example.com