The silly mistakes in exploratory data analysis
I am not sure if I should call those situations “mistakes.” It mostly happens in the first stage of analysis when we discover something obvious, something well known to everybody. Well, to be precise, it is usually something well known to everybody else. The situation becomes a mistake when we announce the results and people say “That’s true. So what?”
I remember the situation in which I was analyzing who is buying a certain kind of service and I did not know a lot about the business yet. Picture the conversation: “What is the result?” “Well, the people who … (here comes a description of a group) buy the most.” “Yeah… great. That’s our largest segment of customers. Who besides them?” “(another description)” “So… the second largest segment? Did you spend half of a day on that?”
At least it helped me notice that I don’t have enough information to differentiate between customers in any meaningful way and I should spend more time adding new features to the dataset.
It is funny when people who analyze data “discover” facts that are well known to everybody else. I don’t feel wrong about that, because every such situation reminds me of the stories told by my data mining professor at the university.
Before I tell the story, I must clarify something: a unit of blood is the volume of blood in a single container used for transfusions (approximately 500 ml).
Back to the story. The professor’s team were analyzing medical data related to some surgery (I don’t remember the details). At some point, they spotted a few disturbing correlations. They discovered that patients who lose more than two units of blood always have some complications after the surgery.
The team of IT guys who don’t know a lot about medicine were proud of themselves. The professor said that they even envisioned getting a Nobel Prize in medicine.
When they announced the result to the doctors who gave them the data, the doctors started laughing. After that, they showed them the first chapter in a book for medicine students. What the team taught to be “groundbreaking results” was literally at the beginning of a “Surgery 101” book.
The second story told by the professor was about the same dataset. The team noticed that if a particular doctor (let’s call him Doctor X) performs surgery on Monday afternoon, the patient always ends up with some problems. In this case, they were informed that the time and the day of the week do not matter because Doctor X is just a terrible surgeon.
So maybe we should not worry about rediscovering the well-known information? Instead of that, we should be concerned every time when the first, rudiment analysis does not confirm the expected facts because maybe we have just discovered that we are living a lie.
Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- MLOps engineer by day
- AI and data engineering consultant by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @firstname.lastname@example.org