The problem of large categorical variables in machine learning

Recently, I was writing an article about dealing with categorical variables using techniques like one-hot encoding or dummy coding. I wondered what the correct approach is when the categorical variable has many unique values. After all, any encoding would create a vast number of new features.

The first approach is not very sophisticated. We can replace the categories with a group of categories. For example, if the feature contains the names of products in a grocery store, we can replace the names with generic categories of products like a vegetable, cheese, bread, and so on.

Feature hashing

What if there is no hierarchy? What if it is not possible to group categories in any meaningful way? I started looking for a solution, and I found a technique called “feature hashing.”

In short, we are supposed to define a hashing function which reduces the space of the categorical variable because it maps many categories to the same hash. Fortunately, if we use Scikit-learn we don’t need to do it because such a function already exists.

As an input, we must give it the number of features. This value denotes the number of columns in the output. The number of columns it can use to encode categories. It is not the number of groups we want to get!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.feature_extraction import FeatureHasher
import pandas as pd

data = pd.DataFrame([
    ['value_1', 23],
    ['value_2', 13],
    ['value_3', 42],
    ['value_4', 13],
    ['value_2', 46],
    ['value_1', 28],
    ['value_2', 32],
    ['value_3', 87],
    ['value_4', 98],
    ['value_5', 86],
    ['value_3', 45],
    ['value_2', 73],
    ['value_1', 36],
    ['value_3', 93]
], columns = ['feature1', 'feature2'])

feature_hasher = FeatureHasher(n_features = 3, input_type = 'string')

pd.concat([
pd.DataFrame(feature_hasher.fit_transform(data['feature1']).toarray()),
data['feature2']], axis = 1)

There is one problem with the FeatureHasher class in Scikit-learn. I could not get it running inside a ColumnTransformer pipeline, because it throws an error.

I have reported an error. If you want it fixed too, please upvote the issue ;)

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.