The difference between the expanding and rolling window in Pandas

In Pandas, there are two types of window functions. In this article, I am going to demonstrate the difference between them, explain how to choose which function to use, and show you how to deal with datetime in window functions.

First, let’s create a dataset I am going to use as an example.

import pandas as pd
data = pd.DataFrame([
    ['a', 1],
    ['a', 2],
    ['a', 3],
    ['b', 5],
    ['b', 6],
    ['b', 7],
    ['b', 8],
    ['c', 10],
    ['c', 11],
    ['c', 12],
    ['c', 13]
], columns = ['category', 'value'])

Using expanding windows to calculate the cumulative sum

If I use the expanding window with initial size 1, I will create a window that in the first step contains only the first row. In the second step, it contains both the first and the second row. In every step, one additional row is added to the window, and the aggregating function is being recalculated.

data.value.expanding(1).sum()

As a result, I get that output:

0      1.0
1      3.0
2      6.0
3     11.0
4     17.0
5     24.0
6     32.0
7     42.0
8     53.0
9     65.0
10    78.0

Rolling window over n rows

Rolling windows are totally different. In this case, we specify the size of the window which is moving. What happens when I set the rolling window size to 2?

In the first step, it is going to contain the first row and one undefined row, so I am going to get NaN as a result.

In the second step, the window moves and now contains the first and the second row. Now it is possible to calculate the aggregate function. In the case of this example, the sum of both rows.

In the third step, the window moves again and no longer contains the first row. Instead of that now it calculates the sum of the second and the third row.

data.value.rolling(2).sum()

The output:

0      NaN
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0

Avoiding NaN in the results from rolling windows

What to do when I want to get aggregated results in the rows which contain undefined values? I can specify the minimum number of defined values in a window, by using the min_periods parameter:

data.value.rolling(2, min_periods = 1).sum()

The result:

0      1.0
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0

Rolling windows using datetime

Instead of defining the number of rows, it is also possible to use a datetime column as the index and define a window as a time period.

Unfortunately, it is unintuitive and does not work when we use weeks or months as the time period.

Let’s see what is the problem. First, I have to create a new data frame.

data = pd.DataFrame([
    [pd.to_datetime('2019-05-01 00:10:00Z'), 1],
    [pd.to_datetime('2019-05-01 01:00:00Z'), 1],
    [pd.to_datetime('2019-05-01 02:04:12Z'), 1],
    [pd.to_datetime('2019-05-02 10:03:00Z'), 1],
    [pd.to_datetime('2019-05-02 11:21:00Z'), 1],
    [pd.to_datetime('2019-05-02 17:00:00Z'), 1],
    [pd.to_datetime('2019-05-03 03:12:34Z'), 1],
    [pd.to_datetime('2019-05-03 08:23:12Z'), 1],
    [pd.to_datetime('2019-05-04 10:24:10Z'), 1],
    [pd.to_datetime('2019-05-07 19:03:57Z'), 1],
    [pd.to_datetime('2019-05-11 23:59:12Z'), 1]
], columns = ['time', 'value'])

Now, in the window function, I must specify the column which should be used as the index (or have the data frame indexed by the time values) and use the time range to define the size of the rolling window. In this case, I am going to sum the values grouped by days.

data.rolling('D', on = 'time', min_periods = 1).sum()

The result:

time  value
0  2019-05-01 00:10:00+00:00    1.0
1  2019-05-01 01:00:00+00:00    2.0
2  2019-05-01 02:04:12+00:00    3.0
3  2019-05-02 10:03:00+00:00    1.0
4  2019-05-02 11:21:00+00:00    2.0
5  2019-05-02 17:00:00+00:00    3.0
6  2019-05-03 03:12:34+00:00    4.0
7  2019-05-03 08:23:12+00:00    5.0
8  2019-05-04 10:24:10+00:00    1.0
9  2019-05-07 19:03:57+00:00    1.0
10 2019-05-11 23:59:12+00:00    1.0

Do you see the issue? The moving window ends when there is at least 24 hours difference between the values, so I can’t use it to group my example rows by the calendar date.

If I used weeks (or months), it would be even worse, because I would get this error:

<Week: weekday=6> is a non-fixed frequency

The correct way

To get the expected result, (dates grouped by calendar dates + cumulative sum of values within the day) I have to drop the time part of the datetime value, group the rows by the date and calculate the cumulative sum.

# drop the time from datetime
data['without_time'] = data['time'].dt.date

#calculate the cumulative sum by day
sums = data.groupby('without_time').cumsum()
sums.columns = ['cumulative_sum']

#put the result as a column in the original dataframe
pd.concat([data, sums], axis = 1).drop(columns = ['without_time'])

The result:

time  value  cumulative_sum
0  2019-05-01 00:10:00+00:00      1               1
1  2019-05-01 01:00:00+00:00      1               2
2  2019-05-01 02:04:12+00:00      1               3
3  2019-05-02 10:03:00+00:00      1               1
4  2019-05-02 11:21:00+00:00      1               2
5  2019-05-02 17:00:00+00:00      1               3
6  2019-05-03 03:12:34+00:00      1               1
7  2019-05-03 08:23:12+00:00      1               2
8  2019-05-04 10:24:10+00:00      1               1
9  2019-05-07 19:03:57+00:00      1               1
10 2019-05-11 23:59:12+00:00      1               1
Older post

Write everything down

Lessons learnt from "Practical Data Cleaning" by Lee Baker

Newer post

How to get the value by rank from a grouped Pandas dataframe

How to rank a grouped data frame in Pandas