The silly mistakes in exploratory data analysis
I am not sure if I should call those situations “mistakes.” It mostly happens in the first stage of analysis when we discover something obvious, something well known to everybody. Well, to be precise, it is usually something well known to everybody else. The situation becomes a mistake when we announce the results and people say “That’s true. So what?”
I remember the situation in which I was analyzing who is buying a certain kind of service and I did not know a lot about the business yet. Picture the conversation: “What is the result?” “Well, the people who … (here comes a description of a group) buy the most.” “Yeah… great. That’s our largest segment of customers. Who besides them?” “(another description)” “So… the second largest segment? Did you spend half of a day on that?”
At least it helped me notice that I don’t have enough information to differentiate between customers in any meaningful way and I should spend more time adding new features to the dataset.
Parsing machine learning logs with Ahana, a managed Presto service, and Cube, a headless BI solution
Check out my article published on the Cube.dev blog!
It is funny when people who analyze data “discover” facts that are well known to everybody else. I don’t feel wrong about that, because every such situation reminds me of the stories told by my data mining professor at the university.
Before I tell the story, I must clarify something: a unit of blood is the volume of blood in a single container used for transfusions (approximately 500 ml).
Back to the story. The professor’s team were analyzing medical data related to some surgery (I don’t remember the details). At some point, they spotted a few disturbing correlations. They discovered that patients who lose more than two units of blood always have some complications after the surgery.
The team of IT guys who don’t know a lot about medicine were proud of themselves. The professor said that they even envisioned getting a Nobel Prize in medicine.
When they announced the result to the doctors who gave them the data, the doctors started laughing. After that, they showed them the first chapter in a book for medicine students. What the team taught to be “groundbreaking results” was literally at the beginning of a “Surgery 101” book.
The second story told by the professor was about the same dataset. The team noticed that if a particular doctor (let’s call him Doctor X) performs surgery on Monday afternoon, the patient always ends up with some problems. In this case, they were informed that the time and the day of the week do not matter because Doctor X is just a terrible surgeon.
So maybe we should not worry about rediscovering the well-known information? Instead of that, we should be concerned every time when the first, rudiment analysis does not confirm the expected facts because maybe we have just discovered that we are living a lie.
You may also like
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz