Skip to main content

Pandas Stacking

In data analysis, you'll often need to reorganize your data to make it more suitable for specific analyses or visualizations. Pandas provides powerful operations called stack() and unstack() to transform data between wide and long formats. These functions are essential tools in the data reshaping toolkit.

Introduction to Stacking Operations

Stacking and unstacking are operations that pivot data between different representations:

  • Stacking transforms columns into rows, making a "wider" DataFrame "longer"
  • Unstacking transforms rows into columns, making a "longer" DataFrame "wider"

These operations are particularly useful when you need to:

  • Prepare data for specific visualization libraries
  • Perform calculations that require a particular data structure
  • Normalize or denormalize your data

Let's dive into how these operations work with practical examples.

The stack() Method

The stack() method pivots a DataFrame's columns into index levels, creating a Series or DataFrame with a MultiIndex. In simpler terms, it converts columns into rows.

Basic Syntax

python
DataFrame.stack(level=-1, dropna=True)

Parameters:

  • level: The level(s) to stack from the columns
  • dropna: If True, drop rows that contain missing values

Basic Example

Let's see a simple example of stacking:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2']
})

print("Original DataFrame:")
print(df)

# Stack the DataFrame
stacked = df.stack()

print("\nStacked DataFrame:")
print(stacked)

Output:

Original DataFrame:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

Stacked DataFrame:
0 A A0
B B0
C C0
1 A A1
B B1
C C1
2 A A2
B B2
C C2
dtype: object

Notice how the column headers (A, B, C) have become part of the index in the stacked result, creating a MultiIndex.

The unstack() Method

The unstack() method does the opposite of stack(). It pivots a level of the row index to become the column headers, creating a wider DataFrame.

Basic Syntax

python
Series.unstack(level=-1, fill_value=None)
DataFrame.unstack(level=-1, fill_value=None)

Parameters:

  • level: The index level to unstack
  • fill_value: Value to use for missing values

Basic Example

Let's unstack our previous stacked DataFrame:

python
# Continuing from the previous example
unstacked = stacked.unstack()

print("Unstacked back to original:")
print(unstacked)

Output:

Unstacked back to original:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

The result is the same as our original DataFrame, demonstrating that stack() and unstack() are inverses of each other.

Working with MultiIndex DataFrames

Stacking and unstacking are particularly useful with MultiIndex (hierarchical) DataFrames.

Example with MultiIndex

python
# Create a DataFrame with MultiIndex
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'),
('B', 'one'), ('B', 'two')])
columns = ['x', 'y']

df_multi = pd.DataFrame(np.random.randn(4, 2), index=index, columns=columns)

print("MultiIndex DataFrame:")
print(df_multi)

# Stack the columns
stacked = df_multi.stack()

print("\nStacked MultiIndex DataFrame:")
print(stacked)

# Unstack one level
unstacked = stacked.unstack(level=1)

print("\nUnstacked at level 1:")
print(unstacked)

Output (values will vary due to random generation):

MultiIndex DataFrame:
x y
A one -0.173215 -0.861849
two -2.104569 1.024180
B one -0.509652 -1.039268
two -0.291694 -0.487393

Stacked MultiIndex DataFrame:
A one x -0.173215
y -0.861849
two x -2.104569
y 1.024180
B one x -0.509652
y -1.039268
two x -0.291694
y -0.487393
dtype: float64

Unstacked at level 1:
one two
A x -0.173215 -2.104569
y -0.861849 1.024180
B x -0.509652 -0.291694
y -1.039268 -0.487393

Specifying Levels for Stacking and Unstacking

When working with MultiIndex DataFrames, you can specify which level to stack or unstack:

python
# Continuing with our MultiIndex example
# Unstack the first level (index 0)
result = stacked.unstack(level=0)

print("Unstacked at level 0:")
print(result)

Output:

Unstacked at level 0:
A B
one x -0.173215 -0.509652
y -0.861849 -1.039268
two x -2.104569 -0.291694
y 1.024180 -0.487393

Handling Missing Values

When stacking or unstacking, you might encounter missing values. By default:

  • stack() drops rows with NaN values if dropna=True
  • unstack() introduces NaN for missing combinations
python
# Create a DataFrame with some missing values
df_missing = pd.DataFrame({
'A': [1, 2, np.nan],
'B': [4, np.nan, 6]
})

print("DataFrame with missing values:")
print(df_missing)

# Stack with default dropna=True
stacked_drop = df_missing.stack()

print("\nStacked (dropna=True):")
print(stacked_drop)

# Stack with dropna=False
stacked_keep = df_missing.stack(dropna=False)

print("\nStacked (dropna=False):")
print(stacked_keep)

Output:

DataFrame with missing values:
A B
0 1.0 4.0
1 2.0 NaN
2 NaN 6.0

Stacked (dropna=True):
0 A 1.0
B 4.0
1 A 2.0
2 B 6.0
dtype: float64

Stacked (dropna=False):
0 A 1.0
B 4.0
1 A 2.0
B NaN
2 A NaN
B 6.0
dtype: float64

Practical Applications

Reshaping Time Series Data

Stacking and unstacking are particularly useful for time series data:

python
# Create a time series DataFrame
dates = pd.date_range('20230101', periods=3)
df_time = pd.DataFrame({
'Temperature': [20, 22, 21],
'Humidity': [45, 48, 50]
}, index=dates)

print("Time Series Data:")
print(df_time)

# Stack to get a long format
stacked_time = df_time.stack()

print("\nLong Format (stacked):")
print(stacked_time)

# You can reset the index for a tabular format
tabular_format = stacked_time.reset_index()
tabular_format.columns = ['Date', 'Measurement', 'Value']

print("\nTabular Long Format:")
print(tabular_format)

Output:

Time Series Data:
Temperature Humidity
2023-01-01 20 45
2023-01-02 22 48
2023-01-03 21 50

Long Format (stacked):
2023-01-01 Temperature 20
Humidity 45
2023-01-02 Temperature 22
Humidity 48
2023-01-03 Temperature 21
Humidity 50
dtype: int64

Tabular Long Format:
Date Measurement Value
0 2023-01-01 Temperature 20
1 2023-01-01 Humidity 45
2 2023-01-02 Temperature 22
3 2023-01-02 Humidity 48
4 2023-01-03 Temperature 21
5 2023-01-03 Humidity 50

Pivoting Survey Data

Imagine you have survey responses across different years that you want to reshape:

python
# Survey data
survey_data = pd.DataFrame({
'Year': [2020, 2020, 2021, 2021],
'Question': ['Q1', 'Q2', 'Q1', 'Q2'],
'Response': [4.2, 3.8, 4.5, 4.0]
})

print("Survey Data:")
print(survey_data)

# Reshape to have questions as columns and years as rows
pivoted = survey_data.pivot(index='Year', columns='Question', values='Response')

print("\nPivoted Survey Data:")
print(pivoted)

# Stack to return to long format
stacked_back = pivoted.stack()

print("\nStacked Back to Original Format:")
print(stacked_back)

Output:

Survey Data:
Year Question Response
0 2020 Q1 4.2
1 2020 Q2 3.8
2 2021 Q1 4.5
3 2021 Q2 4.0

Pivoted Survey Data:
Question Q1 Q2
Year
2020 4.2 3.8
2021 4.5 4.0

Stacked Back to Original Format:
Year Question
2020 Q1 4.2
Q2 3.8
2021 Q1 4.5
Q2 4.0
dtype: float64

Summary

Stacking and unstacking are powerful techniques in pandas for reshaping your data:

  • Stack (stack()) converts columns to rows, making data longer
  • Unstack (unstack()) converts rows to columns, making data wider
  • You can specify which levels to stack/unstack in MultiIndex DataFrames
  • These operations help prepare data for specific analyses or visualizations

Key points to remember:

  • Stacking/unstacking operations create MultiIndex objects
  • These functions are useful for transitioning between wide and long data formats
  • You can control how missing values are handled with parameters like dropna and fill_value

Practice Exercises

  1. Create a DataFrame with quarterly sales data for different products, then use stacking to transform it into a long format.

  2. Using the DataFrame below, stack it and then unstack it at different levels:

    python
    data = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 250],
    'Profit': [20, 30, 40, 50]
    })
  3. Create a time series with multiple variables, stack it, and then format it for plotting with a visualization library.

Additional Resources

Happy data reshaping!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)