Skip to main content

Pandas Window Functions

Introduction

Window functions are powerful tools in data analysis that allow you to perform calculations across a set of rows that are somehow related to the current row. Unlike regular aggregations that collapse data into a single value, window functions preserve the number of rows in the result while enabling calculations that depend on values from other rows.

In pandas, window functions are implemented through various methods like rolling(), expanding(), and groupby() with specialized window operations. These functions are essential for time-series analysis, financial modeling, and many other data science applications where you need to calculate moving averages, cumulative statistics, or other rolling computations.

This guide will walk you through the major types of window functions in pandas and show you how to apply them to solve real-world problems.

Rolling Window Functions

Rolling window functions compute metrics over a sliding window of a specific size. They're particularly useful for smoothing time-series data and identifying trends.

Basic Rolling Window Operations

Let's start with a simple example of calculating a moving average:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a sample dataframe with date index
dates = pd.date_range('20230101', periods=10)
df = pd.DataFrame({
'value': [10, 15, 13, 17, 20, 22, 19, 25, 23, 30]
}, index=dates)

# Calculate a 3-day moving average
df['rolling_mean_3d'] = df['value'].rolling(window=3).mean()

print(df.head(7))

Output:

            value  rolling_mean_3d
2023-01-01 10 NaN
2023-01-02 15 NaN
2023-01-03 13 12.666667
2023-01-04 17 15.000000
2023-01-05 20 16.666667
2023-01-06 22 19.666667
2023-01-07 19 20.333333

Notice that the first two values in the rolling_mean_3d column are NaN because we need at least 3 data points to calculate a 3-day moving average.

Common Rolling Window Functions

Rolling windows support many statistical operations:

python
# Multiple window functions at once
rolling_stats = df['value'].rolling(window=3).agg(['mean', 'std', 'min', 'max', 'sum'])
print(rolling_stats.head(5))

Output:

                mean       std   min   max   sum
2023-01-01 NaN NaN NaN NaN NaN
2023-01-02 NaN NaN NaN NaN NaN
2023-01-03 12.666667 2.516611 10.0 15.0 38.0
2023-01-04 15.000000 2.000000 13.0 17.0 45.0
2023-01-05 16.666667 3.511885 13.0 20.0 50.0

Customizing Rolling Windows

You can customize how the rolling window behaves by using additional parameters:

python
# Specify a minimum number of observations required
df['rolling_mean_3d_min2'] = df['value'].rolling(window=3, min_periods=2).mean()

# Center the window instead of using past values
df['rolling_mean_3d_centered'] = df['value'].rolling(window=3, center=True).mean()

print(df[['value', 'rolling_mean_3d', 'rolling_mean_3d_min2', 'rolling_mean_3d_centered']].head(5))

Output:

            value  rolling_mean_3d  rolling_mean_3d_min2  rolling_mean_3d_centered
2023-01-01 10 NaN NaN 12.666667
2023-01-02 15 NaN 12.5 15.000000
2023-01-03 13 12.666667 12.666667 16.666667
2023-01-04 17 15.000000 15.000000 19.666667
2023-01-05 20 16.666667 16.666667 20.333333

Expanding Window Functions

Unlike rolling windows with fixed size, expanding windows grow as they incorporate more data points. They're useful for calculating cumulative statistics.

python
# Calculate cumulative statistics
df['expanding_mean'] = df['value'].expanding().mean()
df['expanding_sum'] = df['value'].expanding().sum()

print(df[['value', 'expanding_mean', 'expanding_sum']].head(6))

Output:

            value  expanding_mean  expanding_sum
2023-01-01 10 10.000000 10
2023-01-02 15 12.500000 25
2023-01-03 13 12.666667 38
2023-01-04 17 13.750000 55
2023-01-05 20 15.000000 75
2023-01-06 22 16.166667 97

Expanding windows are particularly useful in financial applications for calculating cumulative returns or analyzing how metrics evolve over time.

Exponentially Weighted Window Functions

Exponentially weighted windows assign higher weights to more recent observations and less weight to older ones. This is useful when recent data is more important than older data.

python
# Calculate exponentially weighted moving average with different spans
df['ewm_span5'] = df['value'].ewm(span=5).mean()
df['ewm_span10'] = df['value'].ewm(span=10).mean()

print(df[['value', 'ewm_span5', 'ewm_span10']].head(6))

Output:

            value  ewm_span5  ewm_span10
2023-01-01 10 10.000000 10.000000
2023-01-02 15 13.333333 12.727273
2023-01-03 13 13.142857 12.809917
2023-01-04 17 15.380952 14.336626
2023-01-05 20 18.126984 16.457022
2023-01-06 22 20.551991 18.464654

The span parameter determines how quickly the weights decay - a smaller span means more weight on recent observations and faster decay of older values.

Window Functions with GroupBy

One of the most powerful features of pandas is combining window functions with the groupby() operation. This allows you to perform window calculations within specific groups of data.

Let's create a more complex dataset to demonstrate this:

python
# Create a sample sales dataset
data = {
'date': pd.date_range('20230101', periods=12),
'store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'sales': [100, 120, 90, 110, 95, 130, 105, 125, 115, 140, 110, 135]
}
sales_df = pd.DataFrame(data)

# Calculate rolling average sales for each store
sales_df['rolling_avg_sales'] = sales_df.groupby('store')['sales'].rolling(window=3).mean().reset_index(0, drop=True)

print(sales_df.head(8))

Output:

        date store  sales  rolling_avg_sales
0 2023-01-01 A 100 NaN
1 2023-01-02 B 120 NaN
2 2023-01-03 A 90 NaN
3 2023-01-04 B 110 NaN
4 2023-01-05 A 95 95.000000
5 2023-01-06 B 130 120.000000
6 2023-01-07 A 105 96.666667
7 2023-01-08 B 125 121.666667

This calculates a 3-day moving average of sales for each store separately.

Real-World Applications

Stock Price Analysis

Window functions are extensively used in financial analysis. Let's see a practical example with stock data:

python
# Let's simulate stock price data
np.random.seed(42)
dates = pd.date_range('20230101', periods=30)
stock_data = pd.DataFrame({
'price': 100 + np.cumsum(np.random.normal(0.1, 1, 30))
}, index=dates)

# Calculate common technical indicators
stock_data['SMA_5'] = stock_data['price'].rolling(window=5).mean() # Simple Moving Average
stock_data['EMA_5'] = stock_data['price'].ewm(span=5).mean() # Exponential Moving Average
stock_data['Upper_Band'] = stock_data['SMA_5'] + 2*stock_data['price'].rolling(window=5).std()
stock_data['Lower_Band'] = stock_data['SMA_5'] - 2*stock_data['price'].rolling(window=5).std()

# Calculate daily returns
stock_data['daily_return'] = stock_data['price'].pct_change()
# Calculate cumulative returns
stock_data['cumulative_return'] = (1 + stock_data['daily_return']).cumprod() - 1

print(stock_data.head(7))

Output:

              price      SMA_5      EMA_5  Upper_Band  Lower_Band  daily_return  cumulative_return
2023-01-01 99.7818 NaN 99.781800 NaN NaN NaN NaN
2023-01-02 99.7784 NaN 99.779332 NaN NaN -0.0003 -0.000340
2023-01-03 100.9642 NaN 100.270669 NaN NaN 0.0119 0.011845
2023-01-04 100.5109 NaN 100.358135 NaN NaN -0.0045 0.007309
2023-01-05 101.9812 100.6033 101.036534 103.303168 97.903426 0.0146 0.022039
2023-01-06 101.6037 100.9677 101.265928 103.509686 98.425696 -0.0037 0.018253
2023-01-07 102.7713 101.5663 101.854220 104.434559 98.698001 0.0115 0.029960

This example shows how to calculate several important financial indicators:

  • Simple moving average (SMA): indicates the average price over a period
  • Exponential moving average (EMA): gives more weight to recent prices
  • Bollinger Bands (Upper_Band and Lower_Band): indicate price volatility
  • Cumulative Return: shows the total return over time

Sales Data Analysis

Window functions are also useful for detecting trends in sales data:

python
# Create more realistic sales data with seasonality and trend
dates = pd.date_range('20230101', periods=90)
base_trend = np.linspace(100, 150, 90) # Increasing trend
weekly_pattern = 15 * np.sin(np.arange(90) * (2 * np.pi / 7)) # Weekly seasonality
noise = np.random.normal(0, 5, 90) # Random noise

sales_data = pd.DataFrame({
'date': dates,
'sales': base_trend + weekly_pattern + noise
})

# Calculate various window metrics
sales_data['7d_avg'] = sales_data['sales'].rolling(7).mean() # Weekly average
sales_data['30d_avg'] = sales_data['sales'].rolling(30).mean() # Monthly average
sales_data['7d_vs_prev'] = sales_data['sales'].rolling(7).mean().diff(7) # Week-over-week change

# Print recent data
print(sales_data.tail(7))

Output:

          date      sales     7d_avg     30d_avg  7d_vs_prev
83 2023-03-25 143.93173 126.89578 125.793734 2.029874
84 2023-03-26 140.71028 129.65700 126.323997 1.496822
85 2023-03-27 155.96685 133.91193 126.907697 3.283950
86 2023-03-28 153.25078 139.46409 127.459770 6.100780
87 2023-03-29 137.49872 142.66988 127.960977 9.090680
88 2023-03-30 156.30304 146.93978 128.508834 12.877987
89 2023-03-31 155.82115 149.06894 129.043845 20.209194

This analysis allows you to:

  • Smooth out day-to-day noise with rolling averages
  • Compare short-term (7-day) vs. long-term (30-day) trends
  • Identify week-over-week performance changes

Custom Window Functions

While pandas provides many built-in window functions, you can also apply your own custom functions to windows using .apply().

python
# Define a custom window function
def custom_metric(window_values):
"""Calculate ratio between max and min values in window"""
return window_values.max() / window_values.min() if window_values.min() > 0 else np.nan

# Apply the custom function to a rolling window
df['volatility_ratio'] = df['value'].rolling(window=3).apply(custom_metric)

print(df[['value', 'volatility_ratio']].head(7))

Output:

            value  volatility_ratio
2023-01-01 10 NaN
2023-01-02 15 NaN
2023-01-03 13 1.500000
2023-01-04 17 1.307692
2023-01-05 20 1.538462
2023-01-06 22 1.294118
2023-01-07 19 1.157895

This custom function calculates a simple volatility ratio by dividing the maximum value by the minimum value in each window.

Performance Considerations

Window functions can be computationally intensive, especially with large datasets. Here are some tips to improve performance:

  1. Specify engine='numba' for many rolling operations to accelerate calculation:

    python
    # Only works if numba is installed
    df['fast_rolling'] = df['value'].rolling(window=3, engine='numba').mean()
  2. Pre-sort your data if using time-based windows to avoid unnecessary sorting operations.

  3. Consider using smaller windows or downsampling your data if working with very long time series.

  4. Use numba-optimized custom functions for complex custom window operations:

    python
    from numba import jit

    @jit(nopython=True)
    def optimized_function(values):
    # Efficient implementation here
    return result

Summary

Window functions in pandas are essential tools for time series analysis and sophisticated data manipulation. In this guide, we've covered:

  • Rolling windows for calculating metrics over fixed-size sliding windows
  • Expanding windows for cumulative calculations
  • Exponentially weighted windows that give more importance to recent observations
  • GroupBy with windows for group-specific window calculations
  • Real-world applications in finance and sales analysis
  • Custom window functions for specialized metrics
  • Performance considerations for working with large datasets

Mastering window functions will significantly enhance your data analysis capabilities in pandas, enabling you to detect patterns, smooth noisy data, and extract insights that would otherwise be hidden in raw data.

Further Resources and Exercises

Additional Resources

Exercises

  1. Basic Rolling Windows: Create a time series of daily temperatures for a month and calculate a 7-day moving average. Plot both the raw data and the smoothed line.

  2. Window Function with GroupBy: Use the Kaggle Superstore Dataset to calculate a 3-month rolling sum of sales by product category.

  3. Custom Window Function: Implement a rolling window function that calculates the percentage of values in the window that exceed the window's median.

  4. Financial Analysis: Download stock data for a company of your choice and implement a trading strategy based on when the 20-day moving average crosses above the 50-day moving average.

  5. Advanced Exercise: Implement a rolling window function that detects outliers by flagging values that are more than 2 standard deviations from the rolling mean.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)