Using scikit-automl for building a classification model

I decided to give scikit-automl a try. After all, if it works well, it allows me to focus on the exciting part of machine learning — feature engineering and getting more data.

We already automate a considerable part of the model building. Nobody sets the hyperparameters manually. In every project, there is either GridSearch or RandomSearch. Why can’t we automate also selecting the right algorithm and applying basic preprocessing?

It turns out we can, but it is not as easy as promised. At least not in scikit-automl. I decided to start with the classical “tutorial dataset” — Titanic.

The first problem I encountered was installing the right packages. That one was quite easy to fix, and I wrote another blog post about it.

When I was able to import the package in my code, I assumed that I could have numerical and categorical variables in my dataset. I kind of can. It is possible to specify the feature type, as long as all the columns are numerical. The parameter tells scikit-automl how to preprocess the variables, but it will not automatically convert text to a numeric representation.

That was disappointing but not as disappointing as lack of ColumnTransformer. It turned out scikit-automl supports only scikit-learn versions between 0.19 and 0.20. My scikit-learn was downgraded, so instead of ColumnTransformer, I had to use LabelEncoder.

I converted the text to labels and continued playing with the tool.

import pandas as pd

data = pd.read_csv('../input/train.csv')
X = data.drop(columns = ['Survived', 'Name', 'Ticket', 'Cabin', 'PassengerId'])
y = data[['Survived']]

X['Embarked'] = X['Embarked'].apply(str)

from sklearn.preprocessing import LabelEncoder

pclass_encoder = LabelEncoder()
X['pclass'] = pclass_encoder.fit_transform(X['Pclass'])

sex_encoder = LabelEncoder()
X['Sex'] = sex_encoder.fit_transform(X['Sex'])

embarked_encoder = LabelEncoder()
X['Embarked'] = embarked_encoder.fit_transform(X['Embarked'])

column_types = (['Categorical'] * 2) + (['Numerical'] * 4) +['Categorical', 'Numerical']

The second problem was the misleading behavior of the n_jobs parameter. Usually, -1 means “use all CPUs.” It should work like this also in scikit-automl, but for some reason, the current version tries to use a non-existing index in some array and throws an error.

Not a huge problem, I can get the number of available CPUs using the following code and pass it as a parameter.

import multiprocessing

cpus = multiprocessing.cpu_count()

Finally, I got it working.

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy
import sklearn

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

import autosklearn.classification

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,
    per_run_time_limit=120,
    delete_tmp_folder_after_terminate=True,
    ensemble_memory_limit = 12288,
    n_jobs = cpus,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5}
)

automl.fit(X_train.copy(), y_train.copy(), feat_type=column_types)
automl.refit(X_train.copy(), y_train.copy())

predictions = automl.predict(X_test)

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
Older post

How to return rows with missing values in Pandas DataFrame

How does it work and why the most popular solution is wrong

Newer post

How To Avoid Data Leakage While Building A Machine Learning Model

What to do when your model works perfectly during testing but fails in production