Let me show you one DataFramed episode which changed my perspective on exploratory data analysis. In this episode, Hugo Bowne-Anderson talks with Cassie Kozyrkov.
Cassie Kozyrkov mentioned a concept which she described as “wasting your data.” According to her, using all our data for exploration means that when we find something interesting and make a hypothesis, we no longer have data which we can use to prove it.
A data point can be used for inspiration or rigor, but not both of them if you are dealing with uncertainty, if you want to go beyond your data.
The critical part is “dealing with uncertainty.” We don’t need two datasets for reporting purposes. If the only thing we are going to do is describing the history and making visualizations, we don’t need the second dataset.
We can just write “In the data from 2018 we have seen this thing.” It does not mean that the same situation is going to happen in the future. It does not say that we can prove any causality. It is just reporting. In this case, it is okay to use only one dataset.
But if we want to make a general observation, something that should be provable or disprovable. We need a validation set.
When we use some data to form an opinion about something, we need to make sure that we validate our opinion on something we did not use to create this opinion. Just like during training a machine learning model.
If we used all our data to get inspired, for figuring out what questions we are going to ask then we have no data left over to answer them rigorously.
The conclusion we make may be unrelated to reality. We should go get another data set based on the same process as the first one and check whether our conclusion still holds up.
We can think of this problem as overfitting the hypothesis to the data used during exploration. That should make it clear why we cannot use the same dataset to validate it.
It is important to remember that we always need a training or exploration set of data and testing set to verify the model or prove the hypothesis. Reusing the same dataset for both of those purposes is just bad science.
To be honest, it should be evident to everyone. After all, in the scientific method, we are supposed to do research before setting the hypothesis. After forming the hypothesis comes time to do experiments to prove or disprove it. Those are separate steps, so we should not reuse data from previous steps.