Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn
What do we do with input DataFrame before building the model? After exploratory data analysis, we start modifying features. We are going to remove some of them, a few needs to be scaled or normalized. Then we encode the categorical features as numbers. A lot of work.
Transforming all input features at once would be nice. Fortunately, we can easily do it in Scikit-Learn. Let’s do it step by step.
Probably everyone who tried creating a machine learning model at least once is familiar with the Titanic dataset. Because of that, I am going to use as an example.
After loading the dataset, I decided that Name, Cabin, Ticket, and PassengerId columns are redundant. My preprocessing pipeline has to remove them. Right now, I am going to store their names in an array:
1 to_be_removed = ['Name', 'Cabin', 'Ticket', 'PassengerId']
After that, I looked for numeric features which should be normalized. There are two such columns: Age and Fare. I have also noticed missing age values. I am going to replace them with the median of passenger’s age.
Now, I can store the names of the numeric columns in another array. I must also define the pipeline which provides the default values for missing features and normalizes all numeric features.
1 2 3 4 5 6 7 8 from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import MinMaxScaler numeric_features = ['Age', 'Fare'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', MinMaxScaler())])
Finally, I can deal with categorical variables. In the first step, I am going to impute the missing values, but in this case, I want to use the most frequently occurring value as the default. The second step of the pipeline transforms categorical variables using one-hot encoding.
As before, I also put the names of the categorical columns in an array.
1 2 3 4 5 6 from sklearn.preprocessing import OneHotEncoder categorical_features = ['Embarked', 'Sex', 'Pclass'] categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder())])
I have everything I need to configure a ColumnTransformer.
I want to keep the columns which have not been transformed, so I set the remainder to “passthrough.” I can also instruct the transformer to drop such columns (just put “drop” as the value or don’t specify it, that is the default behavior).
The second parameter is the combined pipeline. This time, I must configure not only the name of the step and the class that implements it but also the columns that should be processed by that step.
1 2 3 4 5 6 7 8 9 from sklearn.compose import ColumnTransformer preprocessor = ColumnTransformer( remainder = 'passthrough', transformers=[ ('numeric', numeric_transformer, numeric_features), ('categorical', categorical_transformer, categorical_features), ('remove', 'drop', to_be_removed) ])
To transform the columns, call the fit_transform function.
Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Prophet plot explained
- Precision vs. recall - explanation
- Forward feature selection in Scikit-Learn
- Generalized Linear Models — Using linear regression when the dependent variable does not follow Gaussian distribution
- How to avoid bias against underrepresented target classes while training a machine learning model
- MLOps engineer by day
- AI and data engineering consultant by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @firstname.lastname@example.org