Bootstrapping vs. bagging

Those words are often used in the same texts/tutorials. Some people seem to use them as synonyms.

Those things are not the same. They are not even similar. Sure, we see them used in the same context, but they describe two different steps of a single machine learning process.

Bootstrapping

Bootstrapping is a method of sample selection. The formal definition describes it as “random sampling with replacement”. Nevermind, let’s forget the definition for a while and build intuition around this term

In short, it allows us to choose duplicates while sampling (for example when selecting observations to be used for training). It may be useful when we have a small dataset, but the algorithm requires many data. Don’t get too excited. It won’t magically let you successfully use deep learning when you have only 10 examples in the training set.

Bagging

Now, we can move on to “bagging.” Bagging is a technique of fitting multiple classifiers and creating one ensembles model out of them.

Each one of the classifiers gets a different training set, and that is why words “bootstrapping” and “bagging” are often used together. The dataset for every classifier may be generated using bootstrapping.

Bootstrapping and bagging

In Scikit-learn the problem is nicely encapsulated (and not so nicely generalized). We have the sklearn.ensemble.BaggingClassifier classifier.

BaggingClassifier in its default configuration uses bootstrapping to choose samples for the training set of every classifier, but it can be configured to choose a subset of features randomly or to use random sampling without replacement.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.