Forecasting time series: using lag features
A lag features is a fancy name for a variable which contains data from prior time steps. If we have time-series data, we can convert it into rows. Every row contains data about one observation and includes all previous occurrences of that observation.
Let’s assume that I have time-series data about a process. I want to predict the next value of the target variable using the last five values of that variable and other features which describe the process.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np
input_data = pd.DataFrame([
[1, 5, 12, 4],
[1, 5, 16, 5],
[1, 5, 20, 6],
[1, 5, 8, 3],
[1, 5, 10, 3.5],
[1, 5, 22, 6.5],
[2, 8, 12, 44],
[2, 8, 10, 33],
[2, 8, 14, 50],
[2, 8, 8, 15],
[2, 8, 0, 0],
[2, 8, 3, -5]
], columns = ['id', 'constant_feature', 'time_dependent_feature', 'target_variable'])
I want to get something which looks like this:
1
2
3
4
5
6
7
8
9
[
# ['target_lag_1', 'target_lag_2', 'target_lag_3', 'target_lag_4',
# 'target_lag_5', 'time_dependent_feature_lag_0',
# 'time_dependent_feature_lag_1', 'time_dependent_feature_lag_2',
# 'time_dependent_feature_lag_3', 'time_dependent_feature_lag_4',
# 'time_dependent_feature_lag_5', 'constant_feature']
[ 3.5, 3. , 6. , 5. , 4. , 22. , 10. , 8. , 20. , 16. , 12. , 5. ]),
[ 0., 15., 50., 33., 44., 3., 0., 8., 14., 10., 12., 8.])
]
Would you like to help fight youth unemployment while getting mentoring experience?
Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.
The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.
Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.
Check out their job description.
(free advertisement, no affiliate links)
First, I have to index the data frame using the observation identifiers. It helps to retrieve time-series data about that one observation.
1
input_data = input_data.set_index('id')
I want to know how many unique observations I have in the data frame.
1
unique_ids = input_data.index.unique()
Because I am going to iterate over those values and do the following steps:
-
Select only the time-series data related to that one observation.
-
Extract all values of the time-series variables (time_dependent_feature and target_variable).
-
Shift the target variables five times to get five lag features and the new dependent feature (the most recent observation).
-
Shift the other time-series variable six times to get all lag values of that independent feature.
-
Copy the non-time-series variables.
-
Split the data frame into the independent features and the dependent features.
-
Store them in arrays that will be used later for feature scaling, splitting into training/validation/test sets, and finally for the training of a model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
X = []
Y = []
for identifier in unique_ids:
single_process_data = input_data.loc[identifier] #1
data = pd.DataFrame(single_process_data[['target_variable', 'time_dependent_feature']].copy()) #2
data.columns = ['y', 'time_dependent_feature'] #2
# last 5 values of the target variable as "lag" variables (the most recent one is the dependent feature (y))
for i in range(1, 6): #3
data['target_lag_{}'.format(i)] = data.y.shift(i)
# last 6 values of the target variable as "time_dependent_feature" variables
for i in range(0, 6): #4
data['time_dependent_feature_lag_{}'.format(i)] = data.time_dependent_feature.shift(i)
#rewrite constants
data['constant_feature'] = single_process_data['constant_feature'] #5
#the shift operations in the loops create many partial results. They are useless, and we don't want them
data = data.dropna()
y = data.y #6
x = data.drop(['y', 'time_dependent_feature'], axis=1) #6
X.append(np.array(x).flatten()) #7
Y.append(y) #7
Now, the X variable contains all the independent features, and the Y variable contains the dependent feature.
You may also like