Pandas Seasonal Decomposition
Time series data often contains multiple underlying patterns that contribute to the values we observe. To better understand these patterns and make more accurate predictions, we can decompose a time series into its fundamental components. In this tutorial, we'll explore seasonal decomposition in Pandas, a powerful technique for breaking down time series data.
Introduction to Time Series Decomposition
Time series decomposition splits a time series into several component parts, typically:
- Trend - The long-term progression of the series (upward or downward)
- Seasonality - Regular patterns that repeat at fixed intervals
- Residual (or irregular) - Random fluctuations that cannot be attributed to trend or seasonality
Understanding these components helps data analysts and scientists better understand the data, remove unwanted components, and build more accurate forecasting models.
Prerequisites
Before we begin, make sure you have the following libraries installed:
# Install required packages if needed
# !pip install pandas numpy matplotlib statsmodels
Let's import the necessary libraries for our examples:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
Basic Seasonal Decomposition
The seasonal_decompose
function from the statsmodels
library makes it easy to decompose a time series into its components. Let's create a simple example with synthetic data:
# Create a date range
dates = pd.date_range(start='2020-01-01', periods=730, freq='D')
# Create a time series with trend and seasonality
trend = np.linspace(10, 30, 730) # Increasing trend
seasonality = 5 * np.sin(np.arange(730) * (2 * np.pi / 365)) # Yearly seasonality
noise = np.random.normal(0, 1, 730) # Random noise
# Combine components
ts_data = trend + seasonality + noise
# Create a pandas Series
time_series = pd.Series(ts_data, index=dates)
# Display the first few values
print(time_series.head())
Output:
2020-01-01 10.090039
2020-01-02 10.549292
2020-01-03 10.167876
2020-01-04 10.882820
2020-01-05 10.601168
Freq: D, dtype: float64
Let's visualize our time series:
plt.figure(figsize=(12, 6))
plt.plot(time_series)
plt.title('Synthetic Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()
Now, let's decompose this time series into its components:
decomposition = seasonal_decompose(time_series, model='additive', period=365)
# Plot the decomposed components
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 12))
decomposition.observed.plot(ax=ax1)
ax1.set_title('Observed')
decomposition.trend.plot(ax=ax2)
ax2.set_title('Trend')
decomposition.seasonal.plot(ax=ax3)
ax3.set_title('Seasonality')
decomposition.resid.plot(ax=ax4)
ax4.set_title('Residuals')
plt.tight_layout()
plt.show()
Understanding the Components
Let's examine what each component represents:
1. Trend Component
The trend component shows the long-term progression of your time series. It answers questions like:
- Is the data generally increasing or decreasing over time?
- Are there any long-term cycles or patterns?
print("First 5 values of the trend component:")
print(decomposition.trend.head())
print("\nLast 5 values of the trend component:")
print(decomposition.trend.tail())
Output:
First 5 values of the trend component:
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 NaN
2020-01-04 NaN
2020-01-05 10.64342
Freq: D, dtype: float64
Last 5 values of the trend component:
2021-12-28 29.36164
2021-12-29 29.38904
2021-12-30 29.41644
2021-12-31 29.44384
2022-01-01 NaN
Freq: D, dtype: float64
Notice that the trend component has NaN
values at the beginning and end. This is because the trend is calculated using a rolling average, and there aren't enough data points at the boundaries.
2. Seasonal Component
The seasonal component captures recurring patterns at fixed intervals:
print("Seasonal component pattern (first 14 days):")
print(decomposition.seasonal.head(14))
Output:
Seasonal component pattern (first 14 days):
2020-01-01 0.000000
2020-01-02 0.086083
2020-01-03 0.172055
2020-01-04 0.257805
2020-01-05 0.343225
2020-01-06 0.428204
2020-01-07 0.512634
2020-01-08 0.596405
2020-01-09 0.679411
2020-01-10 0.761544
2020-01-11 0.842701
2020-01-12 0.922776
2020-01-13 1.001670
2020-01-14 1.079282
Freq: D, dtype: float64
The seasonal component repeats with the period specified (365 days in our example), showing how values fluctuate within each cycle.
3. Residual Component
Residuals represent what's left after removing trend and seasonality—the "unexplained" part of your data:
print("Residual component statistics:")
print(f"Mean: {decomposition.resid.mean()}")
print(f"Standard Deviation: {decomposition.resid.std()}")
Output:
Residual component statistics:
Mean: -0.0024627243333999917
Standard Deviation: 0.9986236321877903
In an ideal decomposition, residuals should look like random noise with no discernible pattern.
Additive vs. Multiplicative Decomposition
There are two main models for time series decomposition:
-
Additive:
Y = Trend + Seasonality + Residual
- Use when seasonal variations are consistent in magnitude over time
- Our example above used an additive model
-
Multiplicative:
Y = Trend * Seasonality * Residual
- Use when seasonal variations increase/decrease proportionally with the trend
Let's see how to use a multiplicative model:
# Create data with multiplicative seasonality
trend_mult = np.linspace(10, 50, 730)
seasonality_mult = 1 + 0.3 * np.sin(np.arange(730) * (2 * np.pi / 365))
noise_mult = np.random.normal(1, 0.05, 730)
# Combine components multiplicatively
ts_data_mult = trend_mult * seasonality_mult * noise_mult
# Create a pandas Series
time_series_mult = pd.Series(ts_data_mult, index=dates)
# Decompose using multiplicative model
decomposition_mult = seasonal_decompose(time_series_mult, model='multiplicative', period=365)
# Plot the decomposed components
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 12))
decomposition_mult.observed.plot(ax=ax1)
ax1.set_title('Observed (Multiplicative)')
decomposition_mult.trend.plot(ax=ax2)
ax2.set_title('Trend')
decomposition_mult.seasonal.plot(ax=ax3)
ax3.set_title('Seasonality')
decomposition_mult.resid.plot(ax=ax4)
ax4.set_title('Residuals')
plt.tight_layout()
plt.show()
Notice how in the multiplicative model:
- The seasonal component is expressed as factors (around 1.0)
- The amplitude of seasonal variations increases with the trend level
Real-World Example: Analyzing Weather Data
Let's apply seasonal decomposition to real-world data. We'll use monthly temperature data:
# Download temperature data (this is a synthetic example)
# In real applications, you would load your own data
np.random.seed(42)
dates = pd.date_range('2010-01-01', '2019-12-31', freq='M')
temperatures = 20 + 10 * np.sin(np.arange(len(dates)) * (2 * np.pi / 12)) + np.linspace(0, 3, len(dates)) + np.random.normal(0, 2, len(dates))
temp_data = pd.Series(temperatures, index=dates)
# Plot the temperature data
plt.figure(figsize=(12, 6))
temp_data.plot()
plt.title('Monthly Temperature Data (2010-2019)')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()
# Decompose the temperature data
temp_decomposition = seasonal_decompose(temp_data, model='additive', period=12)
# Plot the decomposition
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 10))
temp_decomposition.observed.plot(ax=ax1)
ax1.set_title('Observed Temperature')
temp_decomposition.trend.plot(ax=ax2)
ax2.set_title('Temperature Trend')
temp_decomposition.seasonal.plot(ax=ax3)
ax3.set_title('Temperature Seasonality')
temp_decomposition.resid.plot(ax=ax4)
ax4.set_title('Temperature Residuals')
plt.tight_layout()
plt.show()
Analyzing the Results
From our temperature decomposition, we can observe:
- Trend Component: Shows a general warming trend over the decade.
- Seasonal Component: Clearly shows the yearly temperature cycle with peaks in summer and troughs in winter.
- Residual Component: Captures unusual weather events and other random fluctuations.
Handling Missing Values
Time series data often contains missing values. Let's see how to handle them:
# Create a copy of our time series with some missing values
ts_with_missing = time_series.copy()
# Randomly set 5% of the values to NaN
random_indices = np.random.choice(len(ts_with_missing), size=int(len(ts_with_missing) * 0.05), replace=False)
ts_with_missing.iloc[random_indices] = np.nan
print(f"Number of missing values: {ts_with_missing.isna().sum()}")
# Fill missing values using forward fill method
ts_filled = ts_with_missing.fillna(method='ffill')
# Now we can decompose as before
decomposition_filled = seasonal_decompose(ts_filled, model='additive', period=365)
# Plot to verify the results
plt.figure(figsize=(12, 6))
plt.plot(time_series, label='Original Data')
plt.plot(ts_with_missing, 'r.', label='Missing Values', alpha=0.5)
plt.plot(ts_filled, 'g--', label='Filled Data')
plt.legend()
plt.title('Handling Missing Values in Time Series Data')
plt.show()
Practical Applications of Seasonal Decomposition
- Forecasting: Removing seasonality can improve forecasting models.
- Anomaly Detection: Unexpectedly large residuals can indicate unusual events.
- Seasonal Adjustment: Removing seasonality helps compare values across different time periods.
- Understanding Business Patterns: Identifying predictable cycles helps with planning.
Let's demonstrate an example of anomaly detection:
# Create a time series with an anomaly
anomaly_ts = time_series.copy()
# Add an anomaly at a specific date
anomaly_date = pd.Timestamp('2020-07-15')
anomaly_ts[anomaly_date] += 15 # Add a spike
# Decompose the series
anomaly_decomp = seasonal_decompose(anomaly_ts, model='additive', period=365)
# Check if we can detect the anomaly in the residuals
plt.figure(figsize=(12, 6))
plt.plot(anomaly_decomp.resid)
plt.axhline(y=3*anomaly_decomp.resid.std(), color='r', linestyle='--', label='3σ Threshold')
plt.axhline(y=-3*anomaly_decomp.resid.std(), color='r', linestyle='--')
plt.scatter(anomaly_date, anomaly_decomp.resid[anomaly_date], color='red', s=100, label='Anomaly')
plt.title('Anomaly Detection using Residuals')
plt.legend()
plt.grid(True)
plt.show()
print(f"Residual value at anomaly date: {anomaly_decomp.resid[anomaly_date]}")
print(f"3-sigma threshold: {3*anomaly_decomp.resid.std()}")
Advanced Tips for Better Decomposition
- Choosing the Right Period: Selecting the correct period is crucial for accurate decomposition.
- Handling Trend Changes: Consider using piecewise decomposition for data with trend shifts.
- Filtering: Pre-filtering data can sometimes improve decomposition results.
- Alternative Methods: For complex time series, consider STL decomposition (Seasonal and Trend decomposition using LOESS).
Example of using STL decomposition:
from statsmodels.tsa.seasonal import STL
stl = STL(time_series, period=365)
result = stl.fit()
# Plot the results
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 10))
ax1.plot(result.observed)
ax1.set_title('Original Series')
ax2.plot(result.trend)
ax2.set_title('Trend (STL)')
ax3.plot(result.seasonal)
ax3.set_title('Seasonal (STL)')
ax4.plot(result.resid)
ax4.set_title('Residual (STL)')
plt.tight_layout()
plt.show()
Summary
Seasonal decomposition is a powerful technique for understanding time series data by breaking it down into its fundamental components:
- Trend: The long-term progression of the data
- Seasonality: Regular, recurring patterns
- Residual: Random variations and noise
Using the seasonal_decompose
function from statsmodels
, we can easily perform this analysis in Python. The key benefits include:
- Better understanding of underlying patterns
- Improved forecasting by handling each component separately
- Anomaly detection using residual analysis
- Ability to remove seasonal effects for clearer trend analysis
Remember to choose between additive and multiplicative models based on whether seasonal variations are consistent (additive) or proportional to the trend level (multiplicative).
Additional Resources
- Statsmodels Seasonal Decomposition Documentation
- Time Series Analysis with Python
- Forecasting: Principles and Practice
Exercises
-
Download a dataset of monthly retail sales and decompose it using both additive and multiplicative models. Which model seems more appropriate and why?
-
Create a synthetic time series with a complex seasonal pattern (e.g., both weekly and yearly cycles). Can seasonal decomposition still identify these patterns?
-
Implement an anomaly detection system using seasonal decomposition that flags data points whose residuals exceed a certain threshold.
-
Compare the results of seasonal decomposition and STL decomposition on a real-world dataset. What are the advantages and disadvantages of each method?
-
Choose a time series with missing values. Compare different imputation methods (mean, median, forward-fill, etc.) and assess how they affect the decomposition results.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)