Forecasting time series: using lag features

A lag features is a fancy name for a variable which contains data from prior time steps. If we have time-series data, we can convert it into rows. Every row contains data about one observation and includes all previous occurrences of that observation.

Let’s assume that I have time-series data about a process. I want to predict the next value of the target variable using the last five values of that variable and other features which describe the process.

import pandas as pd
import numpy as np

input_data = pd.DataFrame([
    [1, 5, 12, 4],
    [1, 5, 16, 5],
    [1, 5, 20, 6],
    [1, 5, 8, 3],
    [1, 5, 10, 3.5],
    [1, 5, 22, 6.5],
    [2, 8, 12, 44],
    [2, 8, 10, 33],
    [2, 8, 14, 50],
    [2, 8, 8, 15],
    [2, 8, 0, 0],
    [2, 8, 3, -5]
], columns = ['id', 'constant_feature', 'time_dependent_feature', 'target_variable'])

I want to get something which looks like this:

[
    #  ['target_lag_1', 'target_lag_2', 'target_lag_3', 'target_lag_4',
    #   'target_lag_5', 'time_dependent_feature_lag_0',
    #   'time_dependent_feature_lag_1', 'time_dependent_feature_lag_2',
    #   'time_dependent_feature_lag_3', 'time_dependent_feature_lag_4',
    #   'time_dependent_feature_lag_5', 'constant_feature']
    [ 3.5,  3. ,  6. ,  5. ,  4. , 22. , 10. ,  8. , 20. , 16. , 12. , 5. ]),
    [ 0., 15., 50., 33., 44.,  3.,  0.,  8., 14., 10., 12.,  8.])
]

First, I have to index the data frame using the observation identifiers. It helps to retrieve time-series data about that one observation.

input_data = input_data.set_index('id')

I want to know how many unique observations I have in the data frame.

unique_ids = input_data.index.unique()

Because I am going to iterate over those values and do the following steps:

  1. Select only the time-series data related to that one observation.

  2. Extract all values of the time-series variables (time_dependent_feature and target_variable).

  3. Shift the target variables five times to get five lag features and the new dependent feature (the most recent observation).

  4. Shift the other time-series variable six times to get all lag values of that independent feature.

  5. Copy the non-time-series variables.

  6. Split the data frame into the independent features and the dependent features.

  7. Store them in arrays that will be used later for feature scaling, splitting into training/validation/test sets, and finally for the training of a model.

X = []
Y = []

for identifier in unique_ids:
    single_process_data = input_data.loc[identifier] #1

    data = pd.DataFrame(single_process_data[['target_variable', 'time_dependent_feature']].copy()) #2
    data.columns = ['y', 'time_dependent_feature'] #2

    # last 5 values of the target variable as "lag" variables (the most recent one is the dependent feature (y))
    for i in range(1, 6): #3
        data['target_lag_{}'.format(i)] = data.y.shift(i)

    # last 6 values of the target variable as "time_dependent_feature" variables
    for i in range(0, 6): #4
        data['time_dependent_feature_lag_{}'.format(i)] = data.time_dependent_feature.shift(i)

    #rewrite constants
    data['constant_feature'] = single_process_data['constant_feature'] #5

    #the shift operations in the loops create many partial results. They are useless, and we don't want them
    data = data.dropna()
    y = data.y #6
    x = data.drop(['y', 'time_dependent_feature'], axis=1) #6

    X.append(np.array(x).flatten()) #7
    Y.append(y) #7

Now, the X variable contains all the independent features, and the Y variable contains the dependent feature.

Older post

How to turn Pandas data frame into time-series input for RNN

From Pandas dataframe to RNN input

Newer post

XGBoost hyperparameter tuning in Python using grid search

Using GridSearchCV from Scikit-Learn to tune XGBoost classifier