How to split a data frame into time-series for LSTM deep neural network
In this article, I am going to show how to prepare a Pandas data frame to use it as an input for a recurrent neural network (for example, LSTM).
As an example, I am going to use a data set of Bitcoin prices. My goal is to train a neural network to use data from the current day + the past 29 days (so 30 days in total) to predict the BTC price during the next seven days.
In my input data frame, I have average daily prices indexed by date.
1 2 3 4 5 6 7 8 btc_prices.head() #date #2011-12-31 4.471603 #2012-01-01 4.806667 #2012-01-02 5.000000 #2012-01-03 5.252500 #2012-01-04 5.208159
Now, I am going to define a function which takes as arguments the data frame to be split, the number of days to be used as independent features (input), and the number of days which are the dependent features (target).
Prepare historical and future data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def build_training_data(dataset, history_size = 30, target_size = 7): start_index = history_size end_index = len(dataset) - target_size data =  labels =  for i in range(start_index, end_index): indices = range(i - history_size, i, 1) data.append(dataset[indices]) labels.append(dataset[i:i + target_size]) data = np.array(data) labels = np.array(labels) return data, labels dataset = btc_prices.values data, labels = build_training_data(dataset)
To show that the function works correctly, I pick a day (in this case, the 2000th day in the data frame) and plot both the data for that date and the labels.
1 2 3 4 5 6 7 8 9 10 11 12 13 plt.rcParams["figure.figsize"] = (20,10) day = 2000 plt.plot(range(0, 30), data[day], 'k-', label = 'Historical data (today + last 29 days)') plt.plot(range(29, 37), np.concatenate([data[day][29:], labels[day]]), 'g-', label = 'To be predicted (next 7 days)') plt.axvline(x=29, c = 'r') plt.plot(29, data[day], 'rx', markersize = 20, label = "Today") plt.xticks(range(0, 37), range(-29, 7)) plt.title('Training data with labels') plt.xlabel('Days') plt.ylabel('BTC price in USD') plt.legend() plt.show()
The part that is not shown here
Remember to split the data into training, validation, and test data frame. Additionally, we must normalize all data (using the mean and standard deviation of the training set).
Preparing LSTM input
Before I can use it as the input for LSTM, I have to reshape the values. I have values of a single feature for 30 days, so the correct shape of the input data frame is (-1, 30, 1).
The label data frame contains seven values of a single feature, so its shape is (-1, 7, 1).
1 2 3 4 5 6 7 8 x_train = x_train.reshape((-1, 30, 1)) y_train = y_train.reshape((-1, 7)) x_validation = x_validation.reshape((-1, 30, 1)) y_validation = y_validation.reshape((-1, 7)) x_test = x_test.reshape((-1, 30, 1)) y_test = y_test.reshape((-1, 7))
To configure the neural network correctly, we have to set (30, 1) as the shape of the input layer and use seven neurons in the output layer.
1 2 3 4 5 tf.keras.models.Sequential([ tf.keras.layers.LSTM(24, input_shape=(30, 1), return_sequences = True), # put the rest of the layers here tf.keras.layers.Dense(7, activation='linear') ])
Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- MLOps engineer by day
- AI and data engineering consultant by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz
- Mastodon: @firstname.lastname@example.org