Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn

What do we do with input DataFrame before building the model? After exploratory data analysis, we start modifying features. We are going to remove some of them, a few needs to be scaled or normalized. Then we encode the categorical features as numbers. A lot of work.

Transforming all input features at once would be nice. Fortunately, we can easily do it in Scikit-Learn. Let’s do it step by step.

Probably everyone who tried creating a machine learning model at least once is familiar with the Titanic dataset. Because of that, I am going to use as an example.

After loading the dataset, I decided that Name, Cabin, Ticket, and PassengerId columns are redundant. My preprocessing pipeline has to remove them. Right now, I am going to store their names in an array:

1
to_be_removed = ['Name', 'Cabin', 'Ticket', 'PassengerId']

After that, I looked for numeric features which should be normalized. There are two such columns: Age and Fare. I have also noticed missing age values. I am going to replace them with the median of passenger’s age.

Now, I can store the names of the numeric columns in another array. I must also define the pipeline which provides the default values for missing features and normalizes all numeric features.

1
2
3
4
5
6
7
8
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())])

Finally, I can deal with categorical variables. In the first step, I am going to impute the missing values, but in this case, I want to use the most frequently occurring value as the default. The second step of the pipeline transforms categorical variables using one-hot encoding.

As before, I also put the names of the categorical columns in an array.

1
2
3
4
5
6
from sklearn.preprocessing import OneHotEncoder

categorical_features = ['Embarked', 'Sex', 'Pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())])

I have everything I need to configure a ColumnTransformer.

I want to keep the columns which have not been transformed, so I set the remainder to “passthrough.” I can also instruct the transformer to drop such columns (just put “drop” as the value or don’t specify it, that is the default behavior).

The second parameter is the combined pipeline. This time, I must configure not only the name of the step and the class that implements it but also the columns that should be processed by that step.

1
2
3
4
5
6
7
8
9
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    remainder = 'passthrough',
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features),
        ('remove', 'drop', to_be_removed)
])

To transform the columns, call the fit_transform function.

1
preprocessor.fit_transform(data)

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.