Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn

Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn

What do we do with input DataFrame before building the model? After exploratory data analysis, we start modifying features. We are going to remove some of them, a few needs to be scaled or normalized. Then we encode the categorical features as numbers. A lot of work.

Transforming all input features at once would be nice. Fortunately, we can easily do it in Scikit-Learn. Let’s do it step by step.

Probably everyone who tried creating a machine learning model at least once is familiar with the Titanic dataset. Because of that, I am going to use as an example.

After loading the dataset, I decided that Name, Cabin, Ticket, and PassengerId columns are redundant. My preprocessing pipeline has to remove them. Right now, I am going to store their names in an array:

1
to_be_removed = ['Name', 'Cabin', 'Ticket', 'PassengerId']

After that, I looked for numeric features which should be normalized. There are two such columns: Age and Fare. I have also noticed missing age values. I am going to replace them with the median of passenger’s age.

Now, I can store the names of the numeric columns in another array. I must also define the pipeline which provides the default values for missing features and normalizes all numeric features.

1
2
3
4
5
6
7
8
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())])

Finally, I can deal with categorical variables. In the first step, I am going to impute the missing values, but in this case, I want to use the most frequently occurring value as the default. The second step of the pipeline transforms categorical variables using one-hot encoding.

As before, I also put the names of the categorical columns in an array.

1
2
3
4
5
6
from sklearn.preprocessing import OneHotEncoder

categorical_features = ['Embarked', 'Sex', 'Pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())])

I have everything I need to configure a ColumnTransformer.

Do you want to show your product/service to 25000 data science enthusiasts every month? I am looking for companies which would like to become a partner of this blog.

Are you interested? Is your employer interested? Here are the details of the offer.

I want to keep the columns which have not been transformed, so I set the remainder to “passthrough.” I can also instruct the transformer to drop such columns (just put “drop” as the value or don’t specify it, that is the default behavior).

The second parameter is the combined pipeline. This time, I must configure not only the name of the step and the class that implements it but also the columns that should be processed by that step.

1
2
3
4
5
6
7
8
9
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    remainder = 'passthrough',
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features),
        ('remove', 'drop', to_be_removed)
])

To transform the columns, call the fit_transform function.

1
preprocessor.fit_transform(data)

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


If this article was helpful, consider donating to WWF or any other charity of your choice.
Bartosz Mikulski
Bartosz Mikulski * data scientist / software engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group