Forward feature selection in Scikit-Learn

A pull request with an implementation of forward feature selection waits in the Scikit-Learn repository since April 2017 (https://github.com/scikit-learn/scikit-learn/pull/8684), so we probably should get used to workarounds ;)

To get an equivalent of forward feature selection in Scikit-Learn we need two things:

  • SelectFromModel class from feature_selection package.

  • An estimator which has either coef_ or feature_importances_ attribute after fitting.

Regression

In case of regression, we can implement forward feature selection using Lasso regression. This regression technique uses regularization which prevents the model from using too many features by minimalizing not only the error but also the value of the coefficients. It forces the model to set the coefficients of unimportant variables to 0 which means that such columns are not used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import seaborn as sns
import pandas as pd
mpg = sns.load_dataset('mpg')
mpg = mpg.copy()
mpg = mpg.dropna()

# To simplify the code, I decided to drop the non-numeric columns
mpg.drop(columns = ['origin', 'name'], inplace=True)

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
X = mpg.drop(columns = ['mpg'])
y = mpg['mpg']

estimator = Lasso()
featureSelection = SelectFromModel(estimator)
featureSelection.fit(X, y)

selectedFeatures = featureSelection.transform(X)
selectedFeatures

After fitting the model, we can filter the columns to get a list of variables used by the model.

1
X.columns[featureSelection.get_support()]

Classification

In the case of classification, forward feature selection may be replaced by fitting a decision tree-based classifier. ExtraTreesClassifier fits multiple decision trees. Every tree has access to a subset of given variables.

We can use either Gini impurity or information gain metric to select the feature which gives the best improvement of the classification result.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
# There are some columns which contain the same value but have different names
X = titanic.drop(columns = ['survived', 'alive', 'class', 'who', 'adult_male', 'embark_town', 'alone'])

# Transform categorical variables to numeric values using one-hot encoding
X_one_hot_encoded = pd.get_dummies(X)
y = titanic['survived']

estimator = ExtraTreesClassifier(n_estimators = 10)
featureSelection = SelectFromModel(estimator)
featureSelection.fit(X_one_hot_encoded, y)

selectedFeatures = featureSelection.transform(X_one_hot_encoded)
selectedFeatures

We can get the names of selected columns too.

1
X_one_hot_encoded.columns[featureSelection.get_support()]

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.