XGBoost hyperparameter tuning in Python using grid search

Fortunately, XGBoost implements the scikit-learn API, so tuning its hyperparameters is very easy.

I assume that you have already preprocessed the dataset and split it into training, test dataset, so I will focus only on the tuning part.

First, we have to import XGBoost classifier and GridSearchCV from scikit-learn.

1
2
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

After that, we have to specify the constant parameters of the classifier. We need the objective. In this case, I use the “binary:logistic” function because I train a classifier which handles only two classes. Additionally, I specify the number of threads to speed up the training, and the seed for a random number generator, to get the same results in every run.

1
2
3
4
5
estimator = XGBClassifier(
    objective= 'binary:logistic',
    nthread=4,
    seed=42
)

In the next step, I have to specify the tunable parameters and the range of values.

1
2
3
4
5
parameters = {
    'max_depth': range (2, 10, 1),
    'n_estimators': range(60, 220, 40),
    'learning_rate': [0.1, 0.01, 0.05]
}

Subscribe to the newsletter and join the free email course.

In the last setup step, I configure the GridSearchCV object. I choose the best hyperparameters using the ROC AUC metric to compare the results of 10-fold cross-validation.

1
2
3
4
5
6
7
8
grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=parameters,
    scoring = 'roc_auc',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

Now, we can do the training.

1
grid_search.fit(X, Y)

Here are the results:

1
2
3
4
5
6
7
Fitting 10 folds for each of 96 candidates, totalling 960 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:   11.0s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:   40.1s
[Parallel(n_jobs=10)]: Done 430 tasks      | elapsed:  1.7min
[Parallel(n_jobs=10)]: Done 780 tasks      | elapsed:  3.1min
[Parallel(n_jobs=10)]: Done 960 out of 960 | elapsed:  4.0min finished

The best_estimator_ field contains the best model trained by GridSearch.

1
grid_search.best_estimator_

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.

Do you want to work with me at riskmethods?

REMOTE position (available in Poland or Germany)