“A few useful things to know about machine learning” is the title of a paper written by Pedro Domingos. In this paper, he shares 12 hints described by him as the “folk wisdom” of machine learning which is hard to come by, but it is crucial for success.
The thing that draws my attention is the dichotomy of getting more input data. One the one hand, we all know that the easiest path to making a successful machine learning model is to get more data. We know it because we have learned that getting more data allows us to avoid overfitting. However, is it always true?
What is overfitting? According to Pedro Domingos overfitting consists of bias and variance. He defines bias as the tendency to learn the same thing wrong and variance as learning random things irrespective of the real signal.
When we look at this definition, we see that getting more data does not solve all our problems. If we get more data that is similar to the data we already have, will increase the classifier’s bias. It is going to learn that some classes occur more often than others. It is not always bad, especially if this is what we see in reality.
What may decrease the bias is creating more variance in the training set. We may get more examples that are rare, to avoid overfitting to the most typical case. However, that is going to increase the variance also in the bias. By adding some rare cases we also increase the noise in training data.
Feature engineering vs curse of dimensionality
We may always create more features to make it easier to spot the difference between observations, can’t we? Well, that is another problem. Due to the curse of dimensionality adding more features will decrease the variance between observations, because, as Pedro Domingos said, in high dimensions, all examples look alike.
On the one hand, feature engineering is the most crucial factor. Creating a good feature may give us a new variable that makes the difference between dependent variables visible. On the other hand, it may also add noise that prevents the model from finding the optimal solution.
At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used
What can’t be learned
The last observation is the fact that a machine learning model can learn not everything. Pedro Domingos wrote: “Representable does not imply learnable.” It is not enough to gather data, clean it, create some features, and start training. Sometimes the algorithm fails, and it does not matter what hyperparameters we set.
Occasionally it can be solved by trying using a different algorithm, but there may be cases when we should give up because the result is going to be useless anyway.
Those three issues are the reasons why feature engineering and data preprocessing is the most critical part of creating a model. If we get that wrong, no clever algorithm is going to fix it.