How to turn Pandas data frame into time-series input for RNN

How to turn Pandas data frame into time-series input for RNN

Recently, I had to convert a Pandas data frame into training data for an RNN model. It is not as easy as I expected. I think (I hope…) there is an easier way to do it. If you know it, please send me a message on Twitter.

Input

I have a data frame which contains a customer identifier, a date, and four numeric variables. For every customer, I have a time series of five values (every value consists of four numeric features).

1
2
3
4
5
6
customer_id	feature_1	feature_2	feature_3	feature_4	date
0	0000f46a3911fa3c0805444483337064	0.0	0.0	416.0	0.0	2018-04-30
1	0000f6ccb0745a6a4b88665a16c9f078	0.0	0.0	200.0	0.0	2018-04-30
2	0004aac84e0df4da2b147fca70cf8255	0.0	0.0	167.0	0.0	2018-04-30
3	0004bd2a26a76fe21f786e4fbd80607f	0.0	0.0	25.0	0.0	2018-04-30
4	00050ab1314c0e55a6ca13cf7181fecf	0.0	0.0	10.0	0.0	2018-04-30

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

Steps

First, I have to scale the values to the range from 0 to 1, because neural networks need scaled data to function correctly.

1
2
3
4
5
6
from sklearn import preprocessing

data['feature_1'] = preprocessing.MinMaxScaler().fit_transform(data['feature_1'].values.reshape(-1, 1))
data['feature_2'] = preprocessing.MinMaxScaler().fit_transform(data['feature_2'].values.reshape(-1, 1))
data['feature_3'] = preprocessing.MinMaxScaler().fit_transform(data['feature_3'].values.reshape(-1, 1))
data['feature_4'] = preprocessing.MinMaxScaler().fit_transform(data['feature_4'].values.reshape(-1, 1))

I must create a dataset of all possible customer id and date pairs. I am going to use Pandas multi-index. I have to define the index as a pair of values: the customer identifier and a date. After that, I drop the index to get the columns in the data frame.

1
2
3
4
5
customers = data['customer_id'].unique()
dates = data['date'].unique()

index = pd.MultiIndex.from_product([customers, dates], names = ["customer_id", "date"])
all_customers_and_dates = pd.DataFrame(index = index).reset_index()

Now, I merge all possible pairs with the data I already have. I do it to make sure that I have the same number of values in every of the time series.

1
merged = pd.merge(all_customers_and_dates, data,  how='left', left_on=['customer_id', 'date'], right_on = ['customer_id', 'date'])

Obviously, I get a lot of empty values. I want to focus on converting the data, so I will replace the empty values with the previous existing value.

But first, I have to be sure that there is a previous value I can use (I don’t want to use the values of a different customer!). I have to select the first date and set the value to zero if there is no data.

1
2
3
4
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_1'])), 'feature_1'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_2'])), 'feature_2'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_3'])), 'feature_3'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_4'])), 'feature_4'] = 0

I use the forward fill function to replace missing values with the previous existing value.

1
merged = merged.fillna(axis = 0, method = 'ffill')

Now, I know that every five consecutive rows contain data of a single customer. Additionally, I know how many customers I have in the data frame (in this case, 95420).

I put every numeric column into a separate variable and reshape it to (95420, 5) (95420 people, five values for every one of them).

1
2
3
4
feature_1 = merged['feature_1'].values.reshape(95420, 5)
feature_2 = merged['feature_2'].values.reshape(95420, 5)
feature_3 = merged['feature_3'].values.reshape(95420, 5)
feature_4 = merged['feature_4'].values.reshape(95420, 5)

Now, I can merge the features into one array and reshape it again, to get data of 95420 people, four features, five different dates.

The problem is that reshape puts dates as “columns,” and we need the dates as “rows.” We can solve the issue by transposing the inner arrays.

1
2
3
reshaped = numpy.hstack(
    (feature_1, feature_2, feature_3, feature_4)
).reshape(95420, 4, 5).transpose(0, 2, 1)

Now I have the data in the correct order: 95420 people, five dates, four features. I can use that 3D array as an input of RNN.

1
2
3
4
5
6
7
8
from keras import layers
from keras.layers import recurrent
from keras.models import Sequential

model = Sequential([
    layers.SimpleRNN(60, input_shape=(5,4)),
    ...
])

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group