XGBoost hyperparameter tuning in Python using grid search

XGBoost hyperparameter tuning in Python using grid search

Fortunately, XGBoost implements the scikit-learn API, so tuning its hyperparameters is very easy.

I assume that you have already preprocessed the dataset and split it into training, test dataset, so I will focus only on the tuning part.

First, we have to import XGBoost classifier and GridSearchCV from scikit-learn.

1
2
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

After that, we have to specify the constant parameters of the classifier. We need the objective. In this case, I use the “binary:logistic” function because I train a classifier which handles only two classes. Additionally, I specify the number of threads to speed up the training, and the seed for a random number generator, to get the same results in every run.

1
2
3
4
5
estimator = XGBClassifier(
    objective= 'binary:logistic',
    nthread=4,
    seed=42
)

In the next step, I have to specify the tunable parameters and the range of values.

1
2
3
4
5
parameters = {
    'max_depth': range (2, 10, 1),
    'n_estimators': range(60, 220, 40),
    'learning_rate': [0.1, 0.01, 0.05]
}

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

In the last setup step, I configure the GridSearchCV object. I choose the best hyperparameters using the ROC AUC metric to compare the results of 10-fold cross-validation.

1
2
3
4
5
6
7
8
grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=parameters,
    scoring = 'roc_auc',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

Now, we can do the training.

1
grid_search.fit(X, Y)

Here are the results:

1
2
3
4
5
6
7
Fitting 10 folds for each of 96 candidates, totalling 960 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:   11.0s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:   40.1s
[Parallel(n_jobs=10)]: Done 430 tasks      | elapsed:  1.7min
[Parallel(n_jobs=10)]: Done 780 tasks      | elapsed:  3.1min
[Parallel(n_jobs=10)]: Done 960 out of 960 | elapsed:  4.0min finished

The best_estimator_ field contains the best model trained by GridSearch.

1
grid_search.best_estimator_

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group