Recently, I had to convert a Pandas data frame into training data for an RNN model. It is not as easy as I expected. I think (I hope…) there is an easier way to do it. If you know it, please send me a message on Twitter.

Input

I have a data frame which contains a customer identifier, a date, and four numeric variables. For every customer, I have a time series of five values (every value consists of four numeric features).

customer_id	feature_1	feature_2	feature_3	feature_4	date
0000f46a3911fa3c0805444483337064	0.0	0.0	416.0	0.0	2018-04-30
0000f6ccb0745a6a4b88665a16c9f078	0.0	0.0	200.0	0.0	2018-04-30
0004aac84e0df4da2b147fca70cf8255	0.0	0.0	167.0	0.0	2018-04-30
0004bd2a26a76fe21f786e4fbd80607f	0.0	0.0	25.0	0.0	2018-04-30
00050ab1314c0e55a6ca13cf7181fecf	0.0	0.0	10.0	0.0	2018-04-30

Steps

First, I have to scale the values to the range from 0 to 1, because neural networks need scaled data to function correctly.

from sklearn import preprocessing

data['feature_1'] = preprocessing.MinMaxScaler().fit_transform(data['feature_1'].values.reshape(-1, 1))
data['feature_2'] = preprocessing.MinMaxScaler().fit_transform(data['feature_2'].values.reshape(-1, 1))
data['feature_3'] = preprocessing.MinMaxScaler().fit_transform(data['feature_3'].values.reshape(-1, 1))
data['feature_4'] = preprocessing.MinMaxScaler().fit_transform(data['feature_4'].values.reshape(-1, 1))

I must create a dataset of all possible customer id and date pairs. I am going to use Pandas multi-index. I have to define the index as a pair of values: the customer identifier and a date. After that, I drop the index to get the columns in the data frame.

customers = data['customer_id'].unique()
dates = data['date'].unique()

index = pd.MultiIndex.from_product([customers, dates], names = ["customer_id", "date"])
all_customers_and_dates = pd.DataFrame(index = index).reset_index()

Now, I merge all possible pairs with the data I already have. I do it to make sure that I have the same number of values in every of the time series.

merged = pd.merge(all_customers_and_dates, data,  how='left', left_on=['customer_id', 'date'], right_on = ['customer_id', 'date'])

Obviously, I get a lot of empty values. I want to focus on converting the data, so I will replace the empty values with the previous existing value.

But first, I have to be sure that there is a previous value I can use (I don’t want to use the values of a different customer!). I have to select the first date and set the value to zero if there is no data.

merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_1'])), 'feature_1'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_2'])), 'feature_2'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_3'])), 'feature_3'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_4'])), 'feature_4'] = 0

I use the forward fill function to replace missing values with the previous existing value.

merged = merged.fillna(axis = 0, method = 'ffill')

Now, I know that every five consecutive rows contain data of a single customer. Additionally, I know how many customers I have in the data frame (in this case, 95420).

I put every numeric column into a separate variable and reshape it to (95420, 5) (95420 people, five values for every one of them).

feature_1 = merged['feature_1'].values.reshape(95420, 5)
feature_2 = merged['feature_2'].values.reshape(95420, 5)
feature_3 = merged['feature_3'].values.reshape(95420, 5)
feature_4 = merged['feature_4'].values.reshape(95420, 5)

Now, I can merge the features into one array and reshape it again, to get data of 95420 people, four features, five different dates.

The problem is that reshape puts dates as “columns,” and we need the dates as “rows.” We can solve the issue by transposing the inner arrays.

reshaped = numpy.hstack(
    (feature_1, feature_2, feature_3, feature_4)
).reshape(95420, 4, 5).transpose(0, 2, 1)

Now I have the data in the correct order: 95420 people, five dates, four features. I can use that 3D array as an input of RNN.

from keras import layers
from keras.layers import recurrent
from keras.models import Sequential

model = Sequential([
    layers.SimpleRNN(60, input_shape=(5,4)),
    ...
])

How to turn Pandas data frame into time-series input for RNN

Input

Steps

How to automatically select the hyperparameters of a ResNet neural network

Forecasting time series: using lag features

How to turn Pandas data frame into time-series input for RNN

Input

Steps

How to automatically select the hyperparameters of a ResNet neural network

Forecasting time series: using lag features

Related Posts

I worked as a data scientist and that was the worst job I have ever had.

Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models

How to remove outliers from Seaborn boxplot charts