Pandas Stacking

In data analysis, you'll often need to reorganize your data to make it more suitable for specific analyses or visualizations. Pandas provides powerful operations called stack() and unstack() to transform data between wide and long formats. These functions are essential tools in the data reshaping toolkit.

Introduction to Stacking Operations

Stacking and unstacking are operations that pivot data between different representations:

Stacking transforms columns into rows, making a "wider" DataFrame "longer"
Unstacking transforms rows into columns, making a "longer" DataFrame "wider"

These operations are particularly useful when you need to:

Prepare data for specific visualization libraries
Perform calculations that require a particular data structure
Normalize or denormalize your data

Let's dive into how these operations work with practical examples.

The `stack()` Method

The stack() method pivots a DataFrame's columns into index levels, creating a Series or DataFrame with a MultiIndex. In simpler terms, it converts columns into rows.

Basic Syntax

DataFrame.stack(level=-1, dropna=True)

Parameters:

level: The level(s) to stack from the columns
dropna: If True, drop rows that contain missing values

Basic Example

Let's see a simple example of stacking:

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2'],
    'C': ['C0', 'C1', 'C2']
})

print("Original DataFrame:")
print(df)

# Stack the DataFrame
stacked = df.stack()

print("\nStacked DataFrame:")
print(stacked)

Output:

Original DataFrame:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

Stacked DataFrame:
0  A    A0
   B    B0
   C    C0
1  A    A1
   B    B1
   C    C1
2  A    A2
   B    B2
   C    C2
dtype: object

Notice how the column headers (A, B, C) have become part of the index in the stacked result, creating a MultiIndex.

The `unstack()` Method

The unstack() method does the opposite of stack(). It pivots a level of the row index to become the column headers, creating a wider DataFrame.

Basic Syntax

Series.unstack(level=-1, fill_value=None)
DataFrame.unstack(level=-1, fill_value=None)

Parameters:

level: The index level to unstack
fill_value: Value to use for missing values

Basic Example

Let's unstack our previous stacked DataFrame:

# Continuing from the previous example
unstacked = stacked.unstack()

print("Unstacked back to original:")
print(unstacked)

Output:

Unstacked back to original:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

The result is the same as our original DataFrame, demonstrating that stack() and unstack() are inverses of each other.

Working with MultiIndex DataFrames

Stacking and unstacking are particularly useful with MultiIndex (hierarchical) DataFrames.

Example with MultiIndex

# Create a DataFrame with MultiIndex
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), 
                                   ('B', 'one'), ('B', 'two')])
columns = ['x', 'y']

df_multi = pd.DataFrame(np.random.randn(4, 2), index=index, columns=columns)

print("MultiIndex DataFrame:")
print(df_multi)

# Stack the columns
stacked = df_multi.stack()

print("\nStacked MultiIndex DataFrame:")
print(stacked)

# Unstack one level
unstacked = stacked.unstack(level=1)

print("\nUnstacked at level 1:")
print(unstacked)

Output (values will vary due to random generation):

MultiIndex DataFrame:
            x         y
A one -0.173215 -0.861849
  two -2.104569  1.024180
B one -0.509652 -1.039268
  two -0.291694 -0.487393

Stacked MultiIndex DataFrame:
A  one  x   -0.173215
        y   -0.861849
   two  x   -2.104569
        y    1.024180
B  one  x   -0.509652
        y   -1.039268
   two  x   -0.291694
        y   -0.487393
dtype: float64

Unstacked at level 1:
         one       two
A x -0.173215 -2.104569
  y -0.861849  1.024180
B x -0.509652 -0.291694
  y -1.039268 -0.487393

Specifying Levels for Stacking and Unstacking

When working with MultiIndex DataFrames, you can specify which level to stack or unstack:

# Continuing with our MultiIndex example
# Unstack the first level (index 0)
result = stacked.unstack(level=0)

print("Unstacked at level 0:")
print(result)

Output:

Unstacked at level 0:
            A         B
one x -0.173215 -0.509652
    y -0.861849 -1.039268
two x -2.104569 -0.291694
    y  1.024180 -0.487393

Handling Missing Values

When stacking or unstacking, you might encounter missing values. By default:

stack() drops rows with NaN values if dropna=True
unstack() introduces NaN for missing combinations

# Create a DataFrame with some missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [4, np.nan, 6]
})

print("DataFrame with missing values:")
print(df_missing)

# Stack with default dropna=True
stacked_drop = df_missing.stack()

print("\nStacked (dropna=True):")
print(stacked_drop)

# Stack with dropna=False
stacked_keep = df_missing.stack(dropna=False)

print("\nStacked (dropna=False):")
print(stacked_keep)

Output:

DataFrame with missing values:
     A    B
0  1.0  4.0
1  2.0  NaN
2  NaN  6.0

Stacked (dropna=True):
0  A    1.0
   B    4.0
1  A    2.0
2  B    6.0
dtype: float64

Stacked (dropna=False):
0  A    1.0
   B    4.0
1  A    2.0
   B    NaN
2  A    NaN
   B    6.0
dtype: float64

Practical Applications

Reshaping Time Series Data

Stacking and unstacking are particularly useful for time series data:

# Create a time series DataFrame
dates = pd.date_range('20230101', periods=3)
df_time = pd.DataFrame({
    'Temperature': [20, 22, 21],
    'Humidity': [45, 48, 50]
}, index=dates)

print("Time Series Data:")
print(df_time)

# Stack to get a long format
stacked_time = df_time.stack()

print("\nLong Format (stacked):")
print(stacked_time)

# You can reset the index for a tabular format
tabular_format = stacked_time.reset_index()
tabular_format.columns = ['Date', 'Measurement', 'Value']

print("\nTabular Long Format:")
print(tabular_format)

Output:

Time Series Data:
            Temperature  Humidity
2023-01-01           20        45
2023-01-02           22        48
2023-01-03           21        50

Long Format (stacked):
2023-01-01  Temperature    20
            Humidity       45
2023-01-02  Temperature    22
            Humidity       48
2023-01-03  Temperature    21
            Humidity       50
dtype: int64

Tabular Long Format:
        Date Measurement  Value
0 2023-01-01 Temperature     20
1 2023-01-01    Humidity     45
2 2023-01-02 Temperature     22
3 2023-01-02    Humidity     48
4 2023-01-03 Temperature     21
5 2023-01-03    Humidity     50

Pivoting Survey Data

Imagine you have survey responses across different years that you want to reshape:

# Survey data
survey_data = pd.DataFrame({
    'Year': [2020, 2020, 2021, 2021],
    'Question': ['Q1', 'Q2', 'Q1', 'Q2'],
    'Response': [4.2, 3.8, 4.5, 4.0]
})

print("Survey Data:")
print(survey_data)

# Reshape to have questions as columns and years as rows
pivoted = survey_data.pivot(index='Year', columns='Question', values='Response')

print("\nPivoted Survey Data:")
print(pivoted)

# Stack to return to long format
stacked_back = pivoted.stack()

print("\nStacked Back to Original Format:")
print(stacked_back)

Output:

Survey Data:
   Year Question  Response
0  2020       Q1       4.2
1  2020       Q2       3.8
2  2021       Q1       4.5
3  2021       Q2       4.0

Pivoted Survey Data:
Question    Q1   Q2
Year               
2020       4.2  3.8
2021       4.5  4.0

Stacked Back to Original Format:
Year  Question
2020  Q1         4.2
      Q2         3.8
2021  Q1         4.5
      Q2         4.0
dtype: float64

Summary

Stacking and unstacking are powerful techniques in pandas for reshaping your data:

Stack (stack()) converts columns to rows, making data longer
Unstack (unstack()) converts rows to columns, making data wider
You can specify which levels to stack/unstack in MultiIndex DataFrames
These operations help prepare data for specific analyses or visualizations

Key points to remember:

Stacking/unstacking operations create MultiIndex objects
These functions are useful for transitioning between wide and long data formats
You can control how missing values are handled with parameters like dropna and fill_value

Practice Exercises

Create a DataFrame with quarterly sales data for different products, then use stacking to transform it into a long format.

Using the DataFrame below, stack it and then unstack it at different levels:

data = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 250],
    'Profit': [20, 30, 40, 50]
})

Create a time series with multiple variables, stack it, and then format it for plotting with a visualization library.

Additional Resources

Happy data reshaping!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Stacking Operations​

The stack() Method​

Basic Syntax​

Basic Example​

The unstack() Method​

Basic Syntax​

Basic Example​

Working with MultiIndex DataFrames​

Example with MultiIndex​

Specifying Levels for Stacking and Unstacking​

Handling Missing Values​

Practical Applications​

Reshaping Time Series Data​

Pivoting Survey Data​

Summary​

Practice Exercises​

Additional Resources​

Introduction to Stacking Operations

The `stack()` Method

Basic Syntax

Basic Example

The `unstack()` Method

Basic Syntax

Basic Example

Working with MultiIndex DataFrames

Example with MultiIndex

Specifying Levels for Stacking and Unstacking

Handling Missing Values

Practical Applications

Reshaping Time Series Data

Pivoting Survey Data

Summary

Practice Exercises

Additional Resources