Pandas Expanding Aggregation

When analyzing time series data, you'll often need to calculate cumulative statistics or perform calculations that incorporate all previous values up to the current point. This is where pandas' expanding window functionality comes in handy. Unlike rolling windows which maintain a fixed size, expanding windows grow in size as they progress through the data.

What is an Expanding Window?

An expanding window starts with a minimum number of observations and then expands to include all previous observations for each calculation point. This is particularly useful when you want to:

Calculate cumulative statistics
Analyze how metrics evolve over time
Smooth data by considering all historical information

Basic Syntax

The basic syntax for expanding window operations in pandas is:

df.expanding(min_periods=1).aggregate_function()

Where:

min_periods: The minimum number of observations required to have a value (default is 1)
aggregate_function: The function to apply (e.g., mean(), sum(), std(), etc.)

Simple Expanding Window Examples

Let's start with some basic examples to understand how expanding windows work:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a simple time series data
dates = pd.date_range('20230101', periods=10)
df = pd.DataFrame({
    'value': [3, 7, 5, 9, 4, 8, 6, 10, 2, 5]
}, index=dates)

print("Original data:")
print(df)

Output:

Original data:
            value
2023-01-01      3
2023-01-02      7
2023-01-03      5
2023-01-04      9
2023-01-05      4
2023-01-06      8
2023-01-07      6
2023-01-08     10
2023-01-09      2
2023-01-10      5

Expanding Mean

Let's calculate the expanding (cumulative) mean of our data:

df['exp_mean'] = df['value'].expanding().mean()
print("\nExpanding Mean:")
print(df)

Output:

Expanding Mean:
            value  exp_mean
2023-01-01      3  3.000000
2023-01-02      7  5.000000
2023-01-03      5  5.000000
2023-01-04      9  6.000000
2023-01-05      4  5.600000
2023-01-06      8  6.000000
2023-01-07      6  6.000000
2023-01-08     10  6.500000
2023-01-09      2  6.000000
2023-01-10      5  5.900000

In this example, each value in the exp_mean column is the average of all values in the value column up to and including the current row.

Expanding Sum and Standard Deviation

We can also calculate other statistics, such as cumulative sum and standard deviation:

df['exp_sum'] = df['value'].expanding().sum()
df['exp_std'] = df['value'].expanding().std()

print("\nExpanding Sum and Standard Deviation:")
print(df)

Output:

Expanding Sum and Standard Deviation:
            value  exp_mean  exp_sum   exp_std
2023-01-01      3  3.000000       3       NaN
2023-01-02      7  5.000000      10  2.828427
2023-01-03      5  5.000000      15  2.000000
2023-01-04      9  6.000000      24  2.581989
2023-01-05      4  5.600000      28  2.408319
2023-01-06      8  6.000000      36  2.366432
2023-01-07      6  6.000000      42  2.160247
2023-01-08     10  6.500000      52  2.618615
2023-01-09      2  6.000000      54  2.828427
2023-01-10      5  5.900000      59  2.685144

Notice that the standard deviation for the first row is NaN because you need at least two observations to calculate a standard deviation.

Visualizing Expanding Window Results

Visualizing the original data along with the expanding statistics can help understand the effect:

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['value'], label='Original Data', marker='o')
plt.plot(df.index, df['exp_mean'], label='Expanding Mean', linestyle='--')
plt.title('Original Data vs Expanding Mean')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Using Custom Functions with Expanding Windows

You can also apply custom functions to expanding windows using the apply() method:

# Define a custom function to calculate the range (max - min)
def data_range(x):
    return x.max() - x.min()

# Apply the custom function
df['exp_range'] = df['value'].expanding().apply(data_range)

print("\nExpanding Range (Max - Min):")
print(df[['value', 'exp_range']])

Output:

Expanding Range (Max - Min):
            value  exp_range
2023-01-01      3        0.0
2023-01-02      7        4.0
2023-01-03      7        4.0
2023-01-04      9        6.0
2023-01-05      9        6.0
2023-01-06      9        6.0
2023-01-07      9        6.0
2023-01-08     10        7.0
2023-01-09     10        8.0
2023-01-10     10        8.0

Multiple Aggregations at Once

You can compute multiple expanding aggregations at once using the agg() method:

# Calculate multiple statistics at once
multiple_stats = df['value'].expanding().agg(['mean', 'sum', 'std', 'min', 'max'])
print("\nMultiple Expanding Aggregations:")
print(multiple_stats)

Output:

Multiple Expanding Aggregations:
                mean   sum       std  min  max
2023-01-01  3.000000   3.0       NaN  3.0  3.0
2023-01-02  5.000000  10.0  2.828427  3.0  7.0
2023-01-03  5.000000  15.0  2.000000  3.0  7.0
2023-01-04  6.000000  24.0  2.581989  3.0  9.0
2023-01-05  5.600000  28.0  2.408319  3.0  9.0
2023-01-06  6.000000  36.0  2.366432  3.0  9.0
2023-01-07  6.000000  42.0  2.160247  3.0  9.0
2023-01-08  6.500000  52.0  2.618615  3.0 10.0
2023-01-09  6.000000  54.0  2.828427  2.0 10.0
2023-01-10  5.900000  59.0  2.685144  2.0 10.0

Setting Minimum Observations with min_periods

The min_periods parameter controls the minimum number of observations needed for calculation:

# Set min_periods to 3
df['exp_mean_min3'] = df['value'].expanding(min_periods=3).mean()

print("\nExpanding Mean with min_periods=3:")
print(df[['value', 'exp_mean', 'exp_mean_min3']])

Output:

Expanding Mean with min_periods=3:
            value  exp_mean  exp_mean_min3
2023-01-01      3  3.000000            NaN
2023-01-02      7  5.000000            NaN
2023-01-03      5  5.000000       5.000000
2023-01-04      9  6.000000       6.000000
2023-01-05      4  5.600000       5.600000
2023-01-06      8  6.000000       6.000000
2023-01-07      6  6.000000       6.000000
2023-01-08     10  6.500000       6.500000
2023-01-09      2  6.000000       6.000000
2023-01-10      5  5.900000       5.900000

Notice that the first two values of exp_mean_min3 are NaN because we required at least 3 observations.

Real-world Example: Expanding Financial Metrics

Let's look at a more practical example using financial data. We'll calculate expanding metrics like:

Cumulative Return
Expanding Average Daily Return
Expanding Maximum Drawdown

# Create a dataset mimicking financial stock prices
dates = pd.date_range('20230101', periods=15, freq='B')  # Business days
np.random.seed(42)
stock_prices = np.random.normal(0, 1, 15).cumsum() + 100  # Random walk starting around 100

stock_df = pd.DataFrame({
    'price': stock_prices
}, index=dates)

# Calculate daily returns
stock_df['daily_return'] = stock_df['price'].pct_change() * 100  # in percentage

# Calculate expanding metrics
stock_df['cum_return'] = (stock_df['price'] / stock_df['price'].iloc[0] - 1) * 100
stock_df['exp_avg_return'] = stock_df['daily_return'].expanding().mean()

# Function to calculate maximum drawdown
def max_drawdown(series):
    if len(series) < 2:
        return 0
    rolling_max = series.cummax()
    drawdown = (series / rolling_max - 1) * 100
    return drawdown.min()

stock_df['exp_max_drawdown'] = stock_df['price'].expanding(min_periods=2).apply(max_drawdown)

print("\nStock Price Analysis with Expanding Metrics:")
print(stock_df)

Output:

Stock Price Analysis with Expanding Metrics:
              price  daily_return  cum_return  exp_avg_return  exp_max_drawdown
2023-01-02  100.496           NaN    0.496014            NaN          0.000000
2023-01-03   99.251      -1.240204   -0.748800      -1.240204         -1.240204
2023-01-04  100.552       1.311828    0.551790       0.035812         -1.240204
2023-01-05  101.924       1.364081    1.924153       0.478568         -1.240204
2023-01-06  101.850      -0.072353    1.850444       0.340838         -1.240204
2023-01-09  101.969       0.117108    1.969007       0.296092         -1.240204
2023-01-10   99.651      -2.273563   -0.348945      -0.132184         -2.273563
2023-01-11   99.405      -0.246764   -0.594719      -0.148552         -2.490351
2023-01-12  100.260       0.859314    0.260455       -0.022569         -2.490351
2023-01-13  101.118       0.857029    1.118042       0.070727         -2.490351
2023-01-16  100.082      -1.025367    0.082117      -0.034889         -2.490351
2023-01-17  101.851       1.767635    1.851345       0.132804         -2.490351
2023-01-18  101.211      -0.627892    1.211305       0.065880         -2.490351
2023-01-19  101.199      -0.011338    1.199338       0.059894         -2.490351
2023-01-20  100.001      -1.183604    0.001032      -0.028935         -2.490351

This example shows how expanding metrics can help track the cumulative performance of a financial asset over time.

When to Use Expanding Windows

Expanding windows are particularly useful in these scenarios:

Cumulative Metrics: When you need to calculate values that should include all previous data points
Growing Datasets: When your dataset continuously grows and you want to include all historical data
Time Series Analysis: For tracking long-term trends and patterns
Financial Analysis: For calculating cumulative returns, moving averages with growing periods, etc.

Expanding vs. Rolling Windows

While both expanding and rolling windows are used for time series analysis, they serve different purposes:

Expanding windows: Include all previous observations up to the current point
Rolling windows: Maintain a fixed-size window that moves through the data

Here's a quick comparison:

Feature	Expanding Window	Rolling Window
Window size	Grows over time	Fixed size
First calculation	After min_periods	After window_size
Memory usage	Higher (keeps all data)	Lower (only keeps window data)
Use case	Cumulative metrics	Local trends

Summary

Pandas' expanding window functionality provides a powerful tool for calculating cumulative statistics and analyzing how metrics evolve over time. Key points to remember:

Expanding windows start small and grow to include all previous observations
Use expanding() followed by an aggregation function like mean(), sum(), or std()
The min_periods parameter controls the minimum observations required
Custom functions can be applied using the apply() method
Expanding windows are great for cumulative metrics, especially in financial and time series analysis

Further Exercises

Create a dataset of daily temperatures and calculate the expanding mean, min, and max to track temperature trends
Calculate the expanding correlation between two financial assets over time
Create a custom function to calculate the Sharpe ratio (return/risk) using expanding windows
Compare expanding window results with rolling window results for the same dataset
Use expanding windows to create a cumulative sum of precipitation data and visualize rainfall accumulation over time

Additional Resources

With expanding windows, you can gain insights into cumulative patterns and trends in your time series data, making them an essential tool for any data analysis workflow.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What is an Expanding Window?​

Basic Syntax​

Simple Expanding Window Examples​

Expanding Mean​

Expanding Sum and Standard Deviation​

Visualizing Expanding Window Results​

Using Custom Functions with Expanding Windows​

Multiple Aggregations at Once​

Setting Minimum Observations with min_periods​

Real-world Example: Expanding Financial Metrics​

When to Use Expanding Windows​

Expanding vs. Rolling Windows​

Summary​

Further Exercises​

Additional Resources​

What is an Expanding Window?

Basic Syntax

Simple Expanding Window Examples

Expanding Mean

Expanding Sum and Standard Deviation

Visualizing Expanding Window Results

Using Custom Functions with Expanding Windows

Multiple Aggregations at Once

Setting Minimum Observations with min_periods

Real-world Example: Expanding Financial Metrics

When to Use Expanding Windows

Expanding vs. Rolling Windows

Summary

Further Exercises

Additional Resources