Skip to main content

Pandas Expanding Aggregation

When analyzing time series data, you'll often need to calculate cumulative statistics or perform calculations that incorporate all previous values up to the current point. This is where pandas' expanding window functionality comes in handy. Unlike rolling windows which maintain a fixed size, expanding windows grow in size as they progress through the data.

What is an Expanding Window?

An expanding window starts with a minimum number of observations and then expands to include all previous observations for each calculation point. This is particularly useful when you want to:

  • Calculate cumulative statistics
  • Analyze how metrics evolve over time
  • Smooth data by considering all historical information

Basic Syntax

The basic syntax for expanding window operations in pandas is:

python
df.expanding(min_periods=1).aggregate_function()

Where:

  • min_periods: The minimum number of observations required to have a value (default is 1)
  • aggregate_function: The function to apply (e.g., mean(), sum(), std(), etc.)

Simple Expanding Window Examples

Let's start with some basic examples to understand how expanding windows work:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a simple time series data
dates = pd.date_range('20230101', periods=10)
df = pd.DataFrame({
'value': [3, 7, 5, 9, 4, 8, 6, 10, 2, 5]
}, index=dates)

print("Original data:")
print(df)

Output:

Original data:
value
2023-01-01 3
2023-01-02 7
2023-01-03 5
2023-01-04 9
2023-01-05 4
2023-01-06 8
2023-01-07 6
2023-01-08 10
2023-01-09 2
2023-01-10 5

Expanding Mean

Let's calculate the expanding (cumulative) mean of our data:

python
df['exp_mean'] = df['value'].expanding().mean()
print("\nExpanding Mean:")
print(df)

Output:

Expanding Mean:
value exp_mean
2023-01-01 3 3.000000
2023-01-02 7 5.000000
2023-01-03 5 5.000000
2023-01-04 9 6.000000
2023-01-05 4 5.600000
2023-01-06 8 6.000000
2023-01-07 6 6.000000
2023-01-08 10 6.500000
2023-01-09 2 6.000000
2023-01-10 5 5.900000

In this example, each value in the exp_mean column is the average of all values in the value column up to and including the current row.

Expanding Sum and Standard Deviation

We can also calculate other statistics, such as cumulative sum and standard deviation:

python
df['exp_sum'] = df['value'].expanding().sum()
df['exp_std'] = df['value'].expanding().std()

print("\nExpanding Sum and Standard Deviation:")
print(df)

Output:

Expanding Sum and Standard Deviation:
value exp_mean exp_sum exp_std
2023-01-01 3 3.000000 3 NaN
2023-01-02 7 5.000000 10 2.828427
2023-01-03 5 5.000000 15 2.000000
2023-01-04 9 6.000000 24 2.581989
2023-01-05 4 5.600000 28 2.408319
2023-01-06 8 6.000000 36 2.366432
2023-01-07 6 6.000000 42 2.160247
2023-01-08 10 6.500000 52 2.618615
2023-01-09 2 6.000000 54 2.828427
2023-01-10 5 5.900000 59 2.685144

Notice that the standard deviation for the first row is NaN because you need at least two observations to calculate a standard deviation.

Visualizing Expanding Window Results

Visualizing the original data along with the expanding statistics can help understand the effect:

python
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['value'], label='Original Data', marker='o')
plt.plot(df.index, df['exp_mean'], label='Expanding Mean', linestyle='--')
plt.title('Original Data vs Expanding Mean')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Using Custom Functions with Expanding Windows

You can also apply custom functions to expanding windows using the apply() method:

python
# Define a custom function to calculate the range (max - min)
def data_range(x):
return x.max() - x.min()

# Apply the custom function
df['exp_range'] = df['value'].expanding().apply(data_range)

print("\nExpanding Range (Max - Min):")
print(df[['value', 'exp_range']])

Output:

Expanding Range (Max - Min):
value exp_range
2023-01-01 3 0.0
2023-01-02 7 4.0
2023-01-03 7 4.0
2023-01-04 9 6.0
2023-01-05 9 6.0
2023-01-06 9 6.0
2023-01-07 9 6.0
2023-01-08 10 7.0
2023-01-09 10 8.0
2023-01-10 10 8.0

Multiple Aggregations at Once

You can compute multiple expanding aggregations at once using the agg() method:

python
# Calculate multiple statistics at once
multiple_stats = df['value'].expanding().agg(['mean', 'sum', 'std', 'min', 'max'])
print("\nMultiple Expanding Aggregations:")
print(multiple_stats)

Output:

Multiple Expanding Aggregations:
mean sum std min max
2023-01-01 3.000000 3.0 NaN 3.0 3.0
2023-01-02 5.000000 10.0 2.828427 3.0 7.0
2023-01-03 5.000000 15.0 2.000000 3.0 7.0
2023-01-04 6.000000 24.0 2.581989 3.0 9.0
2023-01-05 5.600000 28.0 2.408319 3.0 9.0
2023-01-06 6.000000 36.0 2.366432 3.0 9.0
2023-01-07 6.000000 42.0 2.160247 3.0 9.0
2023-01-08 6.500000 52.0 2.618615 3.0 10.0
2023-01-09 6.000000 54.0 2.828427 2.0 10.0
2023-01-10 5.900000 59.0 2.685144 2.0 10.0

Setting Minimum Observations with min_periods

The min_periods parameter controls the minimum number of observations needed for calculation:

python
# Set min_periods to 3
df['exp_mean_min3'] = df['value'].expanding(min_periods=3).mean()

print("\nExpanding Mean with min_periods=3:")
print(df[['value', 'exp_mean', 'exp_mean_min3']])

Output:

Expanding Mean with min_periods=3:
value exp_mean exp_mean_min3
2023-01-01 3 3.000000 NaN
2023-01-02 7 5.000000 NaN
2023-01-03 5 5.000000 5.000000
2023-01-04 9 6.000000 6.000000
2023-01-05 4 5.600000 5.600000
2023-01-06 8 6.000000 6.000000
2023-01-07 6 6.000000 6.000000
2023-01-08 10 6.500000 6.500000
2023-01-09 2 6.000000 6.000000
2023-01-10 5 5.900000 5.900000

Notice that the first two values of exp_mean_min3 are NaN because we required at least 3 observations.

Real-world Example: Expanding Financial Metrics

Let's look at a more practical example using financial data. We'll calculate expanding metrics like:

  • Cumulative Return
  • Expanding Average Daily Return
  • Expanding Maximum Drawdown
python
# Create a dataset mimicking financial stock prices
dates = pd.date_range('20230101', periods=15, freq='B') # Business days
np.random.seed(42)
stock_prices = np.random.normal(0, 1, 15).cumsum() + 100 # Random walk starting around 100

stock_df = pd.DataFrame({
'price': stock_prices
}, index=dates)

# Calculate daily returns
stock_df['daily_return'] = stock_df['price'].pct_change() * 100 # in percentage

# Calculate expanding metrics
stock_df['cum_return'] = (stock_df['price'] / stock_df['price'].iloc[0] - 1) * 100
stock_df['exp_avg_return'] = stock_df['daily_return'].expanding().mean()

# Function to calculate maximum drawdown
def max_drawdown(series):
if len(series) < 2:
return 0
rolling_max = series.cummax()
drawdown = (series / rolling_max - 1) * 100
return drawdown.min()

stock_df['exp_max_drawdown'] = stock_df['price'].expanding(min_periods=2).apply(max_drawdown)

print("\nStock Price Analysis with Expanding Metrics:")
print(stock_df)

Output:

Stock Price Analysis with Expanding Metrics:
price daily_return cum_return exp_avg_return exp_max_drawdown
2023-01-02 100.496 NaN 0.496014 NaN 0.000000
2023-01-03 99.251 -1.240204 -0.748800 -1.240204 -1.240204
2023-01-04 100.552 1.311828 0.551790 0.035812 -1.240204
2023-01-05 101.924 1.364081 1.924153 0.478568 -1.240204
2023-01-06 101.850 -0.072353 1.850444 0.340838 -1.240204
2023-01-09 101.969 0.117108 1.969007 0.296092 -1.240204
2023-01-10 99.651 -2.273563 -0.348945 -0.132184 -2.273563
2023-01-11 99.405 -0.246764 -0.594719 -0.148552 -2.490351
2023-01-12 100.260 0.859314 0.260455 -0.022569 -2.490351
2023-01-13 101.118 0.857029 1.118042 0.070727 -2.490351
2023-01-16 100.082 -1.025367 0.082117 -0.034889 -2.490351
2023-01-17 101.851 1.767635 1.851345 0.132804 -2.490351
2023-01-18 101.211 -0.627892 1.211305 0.065880 -2.490351
2023-01-19 101.199 -0.011338 1.199338 0.059894 -2.490351
2023-01-20 100.001 -1.183604 0.001032 -0.028935 -2.490351

This example shows how expanding metrics can help track the cumulative performance of a financial asset over time.

When to Use Expanding Windows

Expanding windows are particularly useful in these scenarios:

  1. Cumulative Metrics: When you need to calculate values that should include all previous data points
  2. Growing Datasets: When your dataset continuously grows and you want to include all historical data
  3. Time Series Analysis: For tracking long-term trends and patterns
  4. Financial Analysis: For calculating cumulative returns, moving averages with growing periods, etc.

Expanding vs. Rolling Windows

While both expanding and rolling windows are used for time series analysis, they serve different purposes:

  • Expanding windows: Include all previous observations up to the current point
  • Rolling windows: Maintain a fixed-size window that moves through the data

Here's a quick comparison:

FeatureExpanding WindowRolling Window
Window sizeGrows over timeFixed size
First calculationAfter min_periodsAfter window_size
Memory usageHigher (keeps all data)Lower (only keeps window data)
Use caseCumulative metricsLocal trends

Summary

Pandas' expanding window functionality provides a powerful tool for calculating cumulative statistics and analyzing how metrics evolve over time. Key points to remember:

  • Expanding windows start small and grow to include all previous observations
  • Use expanding() followed by an aggregation function like mean(), sum(), or std()
  • The min_periods parameter controls the minimum observations required
  • Custom functions can be applied using the apply() method
  • Expanding windows are great for cumulative metrics, especially in financial and time series analysis

Further Exercises

  1. Create a dataset of daily temperatures and calculate the expanding mean, min, and max to track temperature trends
  2. Calculate the expanding correlation between two financial assets over time
  3. Create a custom function to calculate the Sharpe ratio (return/risk) using expanding windows
  4. Compare expanding window results with rolling window results for the same dataset
  5. Use expanding windows to create a cumulative sum of precipitation data and visualize rainfall accumulation over time

Additional Resources

With expanding windows, you can gain insights into cumulative patterns and trends in your time series data, making them an essential tool for any data analysis workflow.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)