The difference between the expanding and rolling window in Pandas

In Pandas, there are two types of window functions. In this article, I am going to demonstrate the difference between them, explain how to choose which function to use, and show you how to deal with datetime in window functions.

First, let’s create a dataset I am going to use as an example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
data = pd.DataFrame([
    ['a', 1],
    ['a', 2],
    ['a', 3],
    ['b', 5],
    ['b', 6],
    ['b', 7],
    ['b', 8],
    ['c', 10],
    ['c', 11],
    ['c', 12],
    ['c', 13]
], columns = ['category', 'value'])

Using expanding windows to calculate the cumulative sum

If I use the expanding window with initial size 1, I will create a window that in the first step contains only the first row. In the second step, it contains both the first and the second row. In every step, one additional row is added to the window, and the aggregating function is being recalculated.

1
data.value.expanding(1).sum()

As a result, I get that output:

1
2
3
4
5
6
7
8
9
10
11
0      1.0
1      3.0
2      6.0
3     11.0
4     17.0
5     24.0
6     32.0
7     42.0
8     53.0
9     65.0
10    78.0

Rolling window over n rows

Rolling windows are totally different. In this case, we specify the size of the window which is moving. What happens when I set the rolling window size to 2?

In the first step, it is going to contain the first row and one undefined row, so I am going to get NaN as a result.

In the second step, the window moves and now contains the first and the second row. Now it is possible to calculate the aggregate function. In the case of this example, the sum of both rows.

In the third step, the window moves again and no longer contains the first row. Instead of that now it calculates the sum of the second and the third row.

1
data.value.rolling(2).sum()

The output:

1
2
3
4
5
6
7
8
9
10
11
0      NaN
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0

Avoiding NaN in the results from rolling windows

What to do when I want to get aggregated results in the rows which contain undefined values? I can specify the minimum number of defined values in a window, by using the min_periods parameter:

1
data.value.rolling(2, min_periods = 1).sum()

The result:

1
2
3
4
5
6
7
8
9
10
11
0      1.0
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0

Rolling windows using datetime

Instead of defining the number of rows, it is also possible to use a datetime column as the index and define a window as a time period.

Unfortunately, it is unintuitive and does not work when we use weeks or months as the time period.

Let’s see what is the problem. First, I have to create a new data frame.

1
2
3
4
5
6
7
8
9
10
11
12
13
data = pd.DataFrame([
    [pd.to_datetime('2019-05-01 00:10:00Z'), 1],
    [pd.to_datetime('2019-05-01 01:00:00Z'), 1],
    [pd.to_datetime('2019-05-01 02:04:12Z'), 1],
    [pd.to_datetime('2019-05-02 10:03:00Z'), 1],
    [pd.to_datetime('2019-05-02 11:21:00Z'), 1],
    [pd.to_datetime('2019-05-02 17:00:00Z'), 1],
    [pd.to_datetime('2019-05-03 03:12:34Z'), 1],
    [pd.to_datetime('2019-05-03 08:23:12Z'), 1],
    [pd.to_datetime('2019-05-04 10:24:10Z'), 1],
    [pd.to_datetime('2019-05-07 19:03:57Z'), 1],
    [pd.to_datetime('2019-05-11 23:59:12Z'), 1]
], columns = ['time', 'value'])

Now, in the window function, I must specify the column which should be used as the index (or have the data frame indexed by the time values) and use the time range to define the size of the rolling window. In this case, I am going to sum the values grouped by days.

1
data.rolling('D', on = 'time', min_periods = 1).sum()

The result:

1
2
3
4
5
6
7
8
9
10
11
12
time  value
0  2019-05-01 00:10:00+00:00    1.0
1  2019-05-01 01:00:00+00:00    2.0
2  2019-05-01 02:04:12+00:00    3.0
3  2019-05-02 10:03:00+00:00    1.0
4  2019-05-02 11:21:00+00:00    2.0
5  2019-05-02 17:00:00+00:00    3.0
6  2019-05-03 03:12:34+00:00    4.0
7  2019-05-03 08:23:12+00:00    5.0
8  2019-05-04 10:24:10+00:00    1.0
9  2019-05-07 19:03:57+00:00    1.0
10 2019-05-11 23:59:12+00:00    1.0

Do you see the issue? The moving window ends when there is at least 24 hours difference between the values, so I can’t use it to group my example rows by the calendar date.

If I used weeks (or months), it would be even worse, because I would get this error:

1
<Week: weekday=6> is a non-fixed frequency

The correct way

To get the expected result, (dates grouped by calendar dates + cumulative sum of values within the day) I have to drop the time part of the datetime value, group the rows by the date and calculate the cumulative sum.

1
2
3
4
5
6
7
8
9
# drop the time from datetime 
data['without_time'] = data['time'].dt.date

#calculate the cumulative sum by day
sums = data.groupby('without_time').cumsum()
sums.columns = ['cumulative_sum']

#put the result as a column in the original dataframe
pd.concat([data, sums], axis = 1).drop(columns = ['without_time'])

The result:

1
2
3
4
5
6
7
8
9
10
11
12
time  value  cumulative_sum
0  2019-05-01 00:10:00+00:00      1               1
1  2019-05-01 01:00:00+00:00      1               2
2  2019-05-01 02:04:12+00:00      1               3
3  2019-05-02 10:03:00+00:00      1               1
4  2019-05-02 11:21:00+00:00      1               2
5  2019-05-02 17:00:00+00:00      1               3
6  2019-05-03 03:12:34+00:00      1               1
7  2019-05-03 08:23:12+00:00      1               2
8  2019-05-04 10:24:10+00:00      1               1
9  2019-05-07 19:03:57+00:00      1               1
10 2019-05-11 23:59:12+00:00      1               1

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.