Using scikit-automl for building a classification model

Using scikit-automl for building a classification model

I decided to give scikit-automl a try. After all, if it works well, it allows me to focus on the exciting part of machine learning — feature engineering and getting more data.

We already automate a considerable part of the model building. Nobody sets the hyperparameters manually. In every project, there is either GridSearch or RandomSearch. Why can’t we automate also selecting the right algorithm and applying basic preprocessing?

It turns out we can, but it is not as easy as promised. At least not in scikit-automl. I decided to start with the classical “tutorial dataset” — Titanic.

The first problem I encountered was installing the right packages. That one was quite easy to fix, and I wrote another blog post about it.

When I was able to import the package in my code, I assumed that I could have numerical and categorical variables in my dataset. I kind of can. It is possible to specify the feature type, as long as all the columns are numerical. The parameter tells scikit-automl how to preprocess the variables, but it will not automatically convert text to a numeric representation.

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

That was disappointing but not as disappointing as lack of ColumnTransformer. It turned out scikit-automl supports only scikit-learn versions between 0.19 and 0.20. My scikit-learn was downgraded, so instead of ColumnTransformer, I had to use LabelEncoder.

I converted the text to labels and continued playing with the tool.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd

data = pd.read_csv('../input/train.csv')
X = data.drop(columns = ['Survived', 'Name', 'Ticket', 'Cabin', 'PassengerId'])
y = data[['Survived']]

X['Embarked'] = X['Embarked'].apply(str)

from sklearn.preprocessing import LabelEncoder

pclass_encoder = LabelEncoder()
X['pclass'] = pclass_encoder.fit_transform(X['Pclass'])

sex_encoder = LabelEncoder()
X['Sex'] = sex_encoder.fit_transform(X['Sex'])

embarked_encoder = LabelEncoder()
X['Embarked'] = embarked_encoder.fit_transform(X['Embarked'])

column_types = (['Categorical'] * 2) + (['Numerical'] * 4) +['Categorical', 'Numerical']

The second problem was the misleading behavior of the n_jobs parameter. Usually, -1 means “use all CPUs.” It should work like this also in scikit-automl, but for some reason, the current version tries to use a non-existing index in some array and throws an error.

Not a huge problem, I can get the number of available CPUs using the following code and pass it as a parameter.

1
2
3
import multiprocessing

cpus = multiprocessing.cpu_count()

Finally, I got it working.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy
import sklearn

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

import autosklearn.classification

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,
    per_run_time_limit=120,
    delete_tmp_folder_after_terminate=True,
    ensemble_memory_limit = 12288,
    n_jobs = cpus,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5}
)

automl.fit(X_train.copy(), y_train.copy(), feat_type=column_types)
automl.refit(X_train.copy(), y_train.copy())

predictions = automl.predict(X_test)

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group