Pandas Stacking
In data analysis, you'll often need to reorganize your data to make it more suitable for specific analyses or visualizations. Pandas provides powerful operations called stack()
and unstack()
to transform data between wide and long formats. These functions are essential tools in the data reshaping toolkit.
Introduction to Stacking Operations
Stacking and unstacking are operations that pivot data between different representations:
- Stacking transforms columns into rows, making a "wider" DataFrame "longer"
- Unstacking transforms rows into columns, making a "longer" DataFrame "wider"
These operations are particularly useful when you need to:
- Prepare data for specific visualization libraries
- Perform calculations that require a particular data structure
- Normalize or denormalize your data
Let's dive into how these operations work with practical examples.
The stack()
Method
The stack()
method pivots a DataFrame's columns into index levels, creating a Series or DataFrame with a MultiIndex. In simpler terms, it converts columns into rows.
Basic Syntax
DataFrame.stack(level=-1, dropna=True)
Parameters:
level
: The level(s) to stack from the columnsdropna
: If True, drop rows that contain missing values
Basic Example
Let's see a simple example of stacking:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2']
})
print("Original DataFrame:")
print(df)
# Stack the DataFrame
stacked = df.stack()
print("\nStacked DataFrame:")
print(stacked)
Output:
Original DataFrame:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
Stacked DataFrame:
0 A A0
B B0
C C0
1 A A1
B B1
C C1
2 A A2
B B2
C C2
dtype: object
Notice how the column headers (A, B, C) have become part of the index in the stacked result, creating a MultiIndex.
The unstack()
Method
The unstack()
method does the opposite of stack()
. It pivots a level of the row index to become the column headers, creating a wider DataFrame.
Basic Syntax
Series.unstack(level=-1, fill_value=None)
DataFrame.unstack(level=-1, fill_value=None)
Parameters:
level
: The index level to unstackfill_value
: Value to use for missing values
Basic Example
Let's unstack our previous stacked DataFrame:
# Continuing from the previous example
unstacked = stacked.unstack()
print("Unstacked back to original:")
print(unstacked)
Output:
Unstacked back to original:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
The result is the same as our original DataFrame, demonstrating that stack()
and unstack()
are inverses of each other.
Working with MultiIndex DataFrames
Stacking and unstacking are particularly useful with MultiIndex (hierarchical) DataFrames.
Example with MultiIndex
# Create a DataFrame with MultiIndex
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'),
('B', 'one'), ('B', 'two')])
columns = ['x', 'y']
df_multi = pd.DataFrame(np.random.randn(4, 2), index=index, columns=columns)
print("MultiIndex DataFrame:")
print(df_multi)
# Stack the columns
stacked = df_multi.stack()
print("\nStacked MultiIndex DataFrame:")
print(stacked)
# Unstack one level
unstacked = stacked.unstack(level=1)
print("\nUnstacked at level 1:")
print(unstacked)
Output (values will vary due to random generation):
MultiIndex DataFrame:
x y
A one -0.173215 -0.861849
two -2.104569 1.024180
B one -0.509652 -1.039268
two -0.291694 -0.487393
Stacked MultiIndex DataFrame:
A one x -0.173215
y -0.861849
two x -2.104569
y 1.024180
B one x -0.509652
y -1.039268
two x -0.291694
y -0.487393
dtype: float64
Unstacked at level 1:
one two
A x -0.173215 -2.104569
y -0.861849 1.024180
B x -0.509652 -0.291694
y -1.039268 -0.487393
Specifying Levels for Stacking and Unstacking
When working with MultiIndex DataFrames, you can specify which level to stack or unstack:
# Continuing with our MultiIndex example
# Unstack the first level (index 0)
result = stacked.unstack(level=0)
print("Unstacked at level 0:")
print(result)
Output:
Unstacked at level 0:
A B
one x -0.173215 -0.509652
y -0.861849 -1.039268
two x -2.104569 -0.291694
y 1.024180 -0.487393
Handling Missing Values
When stacking or unstacking, you might encounter missing values. By default:
stack()
drops rows with NaN values ifdropna=True
unstack()
introduces NaN for missing combinations
# Create a DataFrame with some missing values
df_missing = pd.DataFrame({
'A': [1, 2, np.nan],
'B': [4, np.nan, 6]
})
print("DataFrame with missing values:")
print(df_missing)
# Stack with default dropna=True
stacked_drop = df_missing.stack()
print("\nStacked (dropna=True):")
print(stacked_drop)
# Stack with dropna=False
stacked_keep = df_missing.stack(dropna=False)
print("\nStacked (dropna=False):")
print(stacked_keep)
Output:
DataFrame with missing values:
A B
0 1.0 4.0
1 2.0 NaN
2 NaN 6.0
Stacked (dropna=True):
0 A 1.0
B 4.0
1 A 2.0
2 B 6.0
dtype: float64
Stacked (dropna=False):
0 A 1.0
B 4.0
1 A 2.0
B NaN
2 A NaN
B 6.0
dtype: float64
Practical Applications
Reshaping Time Series Data
Stacking and unstacking are particularly useful for time series data:
# Create a time series DataFrame
dates = pd.date_range('20230101', periods=3)
df_time = pd.DataFrame({
'Temperature': [20, 22, 21],
'Humidity': [45, 48, 50]
}, index=dates)
print("Time Series Data:")
print(df_time)
# Stack to get a long format
stacked_time = df_time.stack()
print("\nLong Format (stacked):")
print(stacked_time)
# You can reset the index for a tabular format
tabular_format = stacked_time.reset_index()
tabular_format.columns = ['Date', 'Measurement', 'Value']
print("\nTabular Long Format:")
print(tabular_format)
Output:
Time Series Data:
Temperature Humidity
2023-01-01 20 45
2023-01-02 22 48
2023-01-03 21 50
Long Format (stacked):
2023-01-01 Temperature 20
Humidity 45
2023-01-02 Temperature 22
Humidity 48
2023-01-03 Temperature 21
Humidity 50
dtype: int64
Tabular Long Format:
Date Measurement Value
0 2023-01-01 Temperature 20
1 2023-01-01 Humidity 45
2 2023-01-02 Temperature 22
3 2023-01-02 Humidity 48
4 2023-01-03 Temperature 21
5 2023-01-03 Humidity 50
Pivoting Survey Data
Imagine you have survey responses across different years that you want to reshape:
# Survey data
survey_data = pd.DataFrame({
'Year': [2020, 2020, 2021, 2021],
'Question': ['Q1', 'Q2', 'Q1', 'Q2'],
'Response': [4.2, 3.8, 4.5, 4.0]
})
print("Survey Data:")
print(survey_data)
# Reshape to have questions as columns and years as rows
pivoted = survey_data.pivot(index='Year', columns='Question', values='Response')
print("\nPivoted Survey Data:")
print(pivoted)
# Stack to return to long format
stacked_back = pivoted.stack()
print("\nStacked Back to Original Format:")
print(stacked_back)
Output:
Survey Data:
Year Question Response
0 2020 Q1 4.2
1 2020 Q2 3.8
2 2021 Q1 4.5
3 2021 Q2 4.0
Pivoted Survey Data:
Question Q1 Q2
Year
2020 4.2 3.8
2021 4.5 4.0
Stacked Back to Original Format:
Year Question
2020 Q1 4.2
Q2 3.8
2021 Q1 4.5
Q2 4.0
dtype: float64
Summary
Stacking and unstacking are powerful techniques in pandas for reshaping your data:
- Stack (
stack()
) converts columns to rows, making data longer - Unstack (
unstack()
) converts rows to columns, making data wider - You can specify which levels to stack/unstack in MultiIndex DataFrames
- These operations help prepare data for specific analyses or visualizations
Key points to remember:
- Stacking/unstacking operations create MultiIndex objects
- These functions are useful for transitioning between wide and long data formats
- You can control how missing values are handled with parameters like
dropna
andfill_value
Practice Exercises
-
Create a DataFrame with quarterly sales data for different products, then use stacking to transform it into a long format.
-
Using the DataFrame below, stack it and then unstack it at different levels:
pythondata = pd.DataFrame({
'Region': ['North', 'North', 'South', 'South'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 150, 200, 250],
'Profit': [20, 30, 40, 50]
}) -
Create a time series with multiple variables, stack it, and then format it for plotting with a visualization library.
Additional Resources
- Pandas Official Documentation on Reshaping
- Pandas Cookbook: Reshaping Data
- Data Reshaping in Python: pandas.melt() vs pandas.stack()
Happy data reshaping!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)