How to turn Pandas data frame into time-series input for RNN

Recently, I had to convert a Pandas data frame into training data for an RNN model. It is not as easy as I expected. I think (I hope…) there is an easier way to do it. If you know it, please send me a message on Twitter.

Input

I have a data frame which contains a customer identifier, a date, and four numeric variables. For every customer, I have a time series of five values (every value consists of four numeric features).

1
2
3
4
5
6
customer_id	feature_1	feature_2	feature_3	feature_4	date
0	0000f46a3911fa3c0805444483337064	0.0	0.0	416.0	0.0	2018-04-30
1	0000f6ccb0745a6a4b88665a16c9f078	0.0	0.0	200.0	0.0	2018-04-30
2	0004aac84e0df4da2b147fca70cf8255	0.0	0.0	167.0	0.0	2018-04-30
3	0004bd2a26a76fe21f786e4fbd80607f	0.0	0.0	25.0	0.0	2018-04-30
4	00050ab1314c0e55a6ca13cf7181fecf	0.0	0.0	10.0	0.0	2018-04-30

Steps

First, I have to scale the values to the range from 0 to 1, because neural networks need scaled data to function correctly.

1
2
3
4
5
6
from sklearn import preprocessing

data['feature_1'] = preprocessing.MinMaxScaler().fit_transform(data['feature_1'].values.reshape(-1, 1))
data['feature_2'] = preprocessing.MinMaxScaler().fit_transform(data['feature_2'].values.reshape(-1, 1))
data['feature_3'] = preprocessing.MinMaxScaler().fit_transform(data['feature_3'].values.reshape(-1, 1))
data['feature_4'] = preprocessing.MinMaxScaler().fit_transform(data['feature_4'].values.reshape(-1, 1))

I must create a dataset of all possible customer id and date pairs. I am going to use Pandas multi-index. I have to define the index as a pair of values: the customer identifier and a date. After that, I drop the index to get the columns in the data frame.

1
2
3
4
5
customers = data['customer_id'].unique()
dates = data['date'].unique()

index = pd.MultiIndex.from_product([customers, dates], names = ["customer_id", "date"])
all_customers_and_dates = pd.DataFrame(index = index).reset_index()

Now, I merge all possible pairs with the data I already have. I do it to make sure that I have the same number of values in every of the time series.

1
merged = pd.merge(all_customers_and_dates, data,  how='left', left_on=['customer_id', 'date'], right_on = ['customer_id', 'date'])

Obviously, I get a lot of empty values. I want to focus on converting the data, so I will replace the empty values with the previous existing value.

But first, I have to be sure that there is a previous value I can use (I don’t want to use the values of a different customer!). I have to select the first date and set the value to zero if there is no data.

1
2
3
4
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_1'])), 'feature_1'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_2'])), 'feature_2'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_3'])), 'feature_3'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_4'])), 'feature_4'] = 0

I use the forward fill function to replace missing values with the previous existing value.

1
merged = merged.fillna(axis = 0, method = 'ffill')

Now, I know that every five consecutive rows contain data of a single customer. Additionally, I know how many customers I have in the data frame (in this case, 95420).

I put every numeric column into a separate variable and reshape it to (95420, 5) (95420 people, five values for every one of them).

1
2
3
4
feature_1 = merged['feature_1'].values.reshape(95420, 5)
feature_2 = merged['feature_2'].values.reshape(95420, 5)
feature_3 = merged['feature_3'].values.reshape(95420, 5)
feature_4 = merged['feature_4'].values.reshape(95420, 5)

Now, I can merge the features into one array and reshape it again, to get data of 95420 people, four features, five different dates.

The problem is that reshape puts dates as “columns,” and we need the dates as “rows.” We can solve the issue by transposing the inner arrays.

1
2
3
reshaped = numpy.hstack(
    (feature_1, feature_2, feature_3, feature_4)
).reshape(95420, 4, 5).transpose(0, 2, 1)

Now I have the data in the correct order: 95420 people, five dates, four features. I can use that 3D array as an input of RNN.

1
2
3
4
5
6
7
8
from keras import layers
from keras.layers import recurrent
from keras.models import Sequential

model = Sequential([
    layers.SimpleRNN(60, input_shape=(5,4)),
    ...
])

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.