A few useful things to know about machine learning

A few useful things to know about machine learning” is the title of a paper written by Pedro Domingos. In this paper, he shares 12 hints described by him as the “folk wisdom” of machine learning which is hard to come by, but it is crucial for success.

The thing that draws my attention is the dichotomy of getting more input data. One the one hand, we all know that the easiest path to making a successful machine learning model is to get more data. We know it because we have learned that getting more data allows us to avoid overfitting. However, is it always true?


What is overfitting? According to Pedro Domingos overfitting consists of bias and variance. He defines bias as the tendency to learn the same thing wrong and variance as learning random things irrespective of the real signal.

When we look at this definition, we see that getting more data does not solve all our problems. If we get more data that is similar to the data we already have, will increase the classifier’s bias. It is going to learn that some classes occur more often than others. It is not always bad, especially if this is what we see in reality.

What may decrease the bias is creating more variance in the training set. We may get more examples that are rare, to avoid overfitting to the most typical case. However, that is going to increase the variance also in the bias. By adding some rare cases we also increase the noise in training data.

Feature engineering vs curse of dimensionality

We may always create more features to make it easier to spot the difference between observations, can’t we? Well, that is another problem. Due to the curse of dimensionality adding more features will decrease the variance between observations, because, as Pedro Domingos said, in high dimensions, all examples look alike.

On the one hand, feature engineering is the most crucial factor. Creating a good feature may give us a new variable that makes the difference between dependent variables visible. On the other hand, it may also add noise that prevents the model from finding the optimal solution.

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)

What can’t be learned

The last observation is the fact that a machine learning model can learn not everything. Pedro Domingos wrote: “Representable does not imply learnable.” It is not enough to gather data, clean it, create some features, and start training. Sometimes the algorithm fails, and it does not matter what hyperparameters we set.

Occasionally it can be solved by trying using a different algorithm, but there may be cases when we should give up because the result is going to be useless anyway.

Those three issues are the reasons why feature engineering and data preprocessing is the most critical part of creating a model. If we get that wrong, no clever algorithm is going to fix it.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.