Nested cross-validation in time series forecasting using Scikit-learn and Statsmodels

In this blog post, I am going to describe how to measure the performance of a time-series forecasting model using a variant of cross-validation called “nested cross-validation.” As an example, I am going to use the ARMA model from Statsmodels library.

Cross-validation in time series forecasting

In the case of time series, the cross-validation is not trivial. I cannot choose random samples and assign them to either the test set or the train set because it makes no sense to use the values from the future to forecast values in the past. There is a temporal dependency between observations, and we must preserve that relation during testing.

Before we start cross-validation, we must split the dataset into the cross-validation subset and the test set. In my example, I have a dataset of 309 observations and I am going to use 20% of them as the test set (aka the holdout set).

cross_validation = values[:247]
test = values[247:]

Nested cross-validation

The idea of cross-validation should be more straightforward to grasp when we look at an example. Imagine that I have only 5 observations in my cross-validation set and I want to perform 4-fold cross-validation.

Here is my dataset: [1, 2, 3, 4, 5]

What I want to do is to create 4 pairs of training/test sets that follow those two rules:

  • every test set contains unique observations

  • observations from the training set occur before their corresponding test set

There is only one way to generate such pairs from my dataset. As a result, I get 4 pairs of training/test sets:

  • Training: [1] Test: [2]

  • Training: [1, 2] Test: [3]

  • Training: [1, 2, 3] Test: [4]

  • Training: [1, 2, 3, 4] Test: [5]

Fortunately, I don’t need to do it because there is the TimeSeriesSplit class in Scikit-learn which can generate those pairs.

After generating the training/test sets, I am going to fit an ARMA model and make a prediction. I store the root mean squared error of the prediction in the “rmse” array. After the last test, I am going to calculate the average error.

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

tscv = TimeSeriesSplit(n_splits = 4)
rmse = []

for train_index, test_index in tscv.split(cross_validation):
    cv_train, cv_test = cross_validation.iloc[train_index], cross_validation.iloc[test_index]
    arma = sm.tsa.ARMA(cv_train, (2,2)).fit(disp=False)
    predictions = arma.predict(cv_test.index.values[0], cv_test.index.values[-1])
    true_values = cv_test.values
    rmse.append(sqrt(mean_squared_error(true_values, predictions)))
print("RMSE: {}".format(np.mean(rmse)))

Holdout set

Now, I can tweak the parameters of the ARMA model as long as I want.

When I am satisfied with the result, I can use the test set created in the first code snippet to calculate the final error metric of the model.

arma = sm.tsa.ARMA(cross_validation, (2,2)).fit(disp=False)

predictions = arma.predict(test.index.values[0], test.index.values[-1])

true_values = test.values
sqrt(mean_squared_error(true_values, predictions))

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!


Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon:

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.