Skip to main content

Pandas Unstacking

Unstacking is a powerful reshaping operation in Pandas that transforms data from a "long" format to a "wide" format. It's particularly useful when dealing with hierarchical indices (MultiIndex) and helps in creating more readable tabular views of your data.

What is Unstacking?

Unstacking is the process of pivoting a level of the row index to the column axis. Think of it as taking a specific index level and spreading its unique values across the columns. This operation converts data from a "stacked" format (where values are stacked vertically) to an "unstacked" format (where values are spread horizontally).

Basic Syntax

The basic syntax for unstacking a DataFrame or Series is:

python
df.unstack(level=-1)

Where:

  • level specifies which index level to unstack (default is the innermost level)

Understanding Unstacking with Examples

Let's start with some simple examples to understand how unstacking works.

Example 1: Basic Unstacking

First, let's create a multi-index Series:

python
import pandas as pd
import numpy as np

# Create a multi-index Series
index = pd.MultiIndex.from_tuples([
('A', 'x'), ('A', 'y'), ('A', 'z'),
('B', 'x'), ('B', 'y'), ('B', 'z')
], names=['first', 'second'])

s = pd.Series(np.random.randn(6), index=index)
print("Original Series:")
print(s)

Output:

Original Series:
first second
A x 0.469112
y -0.282863
z -1.509059
B x -1.135632
y 1.212112
z 0.119209
dtype: float64

Now, let's unstack this Series:

python
# Unstack the innermost level (level=-1 or level='second')
unstacked = s.unstack()
print("\nUnstacked Series (default level=-1):")
print(unstacked)

Output:

Unstacked Series (default level=-1):
second x y z
first
A 0.469112 -0.282863 -1.509059
B -1.135632 1.212112 0.119209

Notice how the second level of the index is now represented as columns, and each unique value in that level (x, y, and z) has become a column.

Example 2: Specifying the Level to Unstack

We can also specify which level to unstack:

python
# Create another multi-index Series
index = pd.MultiIndex.from_tuples([
('A', 'one', 'x'), ('A', 'one', 'y'),
('A', 'two', 'x'), ('A', 'two', 'y'),
('B', 'one', 'x'), ('B', 'one', 'y'),
('B', 'two', 'x'), ('B', 'two', 'y')
], names=['first', 'second', 'third'])

s2 = pd.Series(range(8), index=index)
print("Original Series with 3 levels:")
print(s2)

# Unstack the second level
unstacked_level1 = s2.unstack(level='second')
print("\nUnstacked at level 'second':")
print(unstacked_level1)

# Unstack the third level
unstacked_level2 = s2.unstack(level='third')
print("\nUnstacked at level 'third':")
print(unstacked_level2)

Output:

Original Series with 3 levels:
first second third
A one x 0
y 1
two x 2
y 3
B one x 4
y 5
two x 6
y 7
dtype: int64

Unstacked at level 'second':
third x y
first second
A one 0 1
two 2 3
B one 4 5
two 6 7

Unstacked at level 'third':
third x y
first second
A one 0 1
two 2 3
B one 4 5
two 6 7

Example 3: Unstacking DataFrames

Unstacking also works with DataFrames:

python
# Create a DataFrame with MultiIndex
index = pd.MultiIndex.from_product([
['A', 'B'], ['one', 'two']], names=['first', 'second'])
columns = pd.MultiIndex.from_product([
['a', 'b'], [1, 2]], names=['level1', 'level2'])

df = pd.DataFrame(np.random.randn(4, 4), index=index, columns=columns)
print("Original DataFrame:")
print(df)

# Unstack the second level of the index
unstacked = df.unstack('second')
print("\nUnstacked DataFrame (level='second'):")
print(unstacked)

Output:

Original DataFrame:
level1 a b
level2 1 2 1 2
first second
A one -0.173215 0.119209 -1.044236 0.861987
two -0.494929 1.071804 0.721555 -0.706771
B one -1.039575 0.271860 -1.087401 -0.172428
two -0.706893 -0.023514 0.567020 0.276232

Unstacked DataFrame (level='second'):
level1 a b
level2 1 2 1 2
second one two one two one two one two
first
A -0.173215 -0.494929 0.119209 1.071804 -1.044236 0.721555 0.861987 -0.706771
B -1.039575 -0.706893 0.271860 -0.023514 -1.087401 0.567020 -0.172428 0.276232

Handling Missing Data in Unstacked Results

Sometimes, when you unstack data, there might be missing combinations in your original data, which leads to NaN values in the result:

python
# Create a Series with some missing combinations
index = pd.MultiIndex.from_tuples([
('A', 'x'), ('A', 'y'),
('B', 'x'), # Note: ('B', 'y') is missing
], names=['first', 'second'])

s_missing = pd.Series([1, 2, 3], index=index)
print("Series with missing combination:")
print(s_missing)

# Unstack
unstacked_missing = s_missing.unstack()
print("\nUnstacked result with NaN:")
print(unstacked_missing)

# Fill NaN values
filled = s_missing.unstack().fillna(0)
print("\nUnstacked result with NaNs filled:")
print(filled)

Output:

Series with missing combination:
first second
A x 1
y 2
B x 3
dtype: int64

Unstacked result with NaN:
second x y
first
A 1.0 2.0
B 3.0 NaN

Unstacked result with NaNs filled:
second x y
first
A 1.0 2.0
B 3.0 0.0

Practical Applications

Example 1: Analyzing Sales Data

Let's say we have sales data with dimensions like region, product, and time:

python
# Create sample sales data
sales_data = pd.DataFrame({
'Region': ['North', 'North', 'North', 'South', 'South', 'South'] * 2,
'Quarter': ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3'] * 2,
'Product': ['Widgets'] * 6 + ['Gadgets'] * 6,
'Sales': [100, 110, 105, 95, 105, 115, 80, 85, 90, 70, 75, 80]
})

# Set multi-index for the DataFrame
sales_indexed = sales_data.set_index(['Product', 'Region', 'Quarter'])
print("Sales data with hierarchical index:")
print(sales_indexed)

# Unstack to create a more readable view by product and region
product_region_sales = sales_indexed.unstack('Quarter')
print("\nSales by Product and Region across Quarters:")
print(product_region_sales)

Output:

Sales data with hierarchical index:
Sales
Product Region Quarter
Widgets North Q1 100
Q2 110
Q3 105
South Q1 95
Q2 105
Q3 115
Gadgets North Q1 80
Q2 85
Q3 90
South Q1 70
Q2 75
Q3 80

Sales by Product and Region across Quarters:
Sales
Quarter Q1 Q2 Q3
Product Region
Widgets North 100 110 105
South 95 105 115
Gadgets North 80 85 90
South 70 75 80

This makes it much easier to compare quarterly sales for each product and region.

Example 2: Pivot Table-like Functionality

Unstacking can be combined with groupby to create pivot table-like summaries:

python
# Create sample data
data = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', periods=10),
'Category': ['A', 'B', 'A', 'B', 'A', 'A', 'B', 'A', 'B', 'B'],
'Region': ['East', 'East', 'West', 'West', 'East', 'West', 'East', 'West', 'East', 'West'],
'Sales': [100, 150, 200, 50, 300, 250, 175, 225, 125, 75]
})

# Group by Category and Region, and sum the Sales
grouped = data.groupby(['Category', 'Region'])['Sales'].sum()
print("Grouped data:")
print(grouped)

# Unstack by Region to get a Category x Region view
category_region_sales = grouped.unstack('Region')
print("\nCategory by Region view (unstacked):")
print(category_region_sales)

Output:

Grouped data:
Category Region
A East 400
West 675
B East 275
West 125
Name: Sales, dtype: int64

Category by Region view (unstacked):
Region East West
Category
A 400 675
B 275 125

Common Issues and Solutions

Issue 1: "Index contains duplicate entries" Error

python
# Create data with duplicate index entries
duplicated_index = pd.DataFrame({
'group': ['A', 'A', 'A', 'B', 'B'],
'subgroup': ['X', 'Y', 'X', 'X', 'Y'], # Note: ('A', 'X') appears twice
'value': [1, 2, 3, 4, 5]
}).set_index(['group', 'subgroup'])

print("DataFrame with duplicated index:")
print(duplicated_index)

# This will raise an error
try:
duplicated_index.unstack()
except Exception as e:
print(f"\nError when unstacking: {e}")

# Solution: Aggregate the duplicates first
aggregated = duplicated_index.groupby(level=[0, 1]).sum()
print("\nAfter aggregating duplicates:")
print(aggregated)

# Now unstack works
print("\nUnstacked result:")
print(aggregated.unstack())

Output:

DataFrame with duplicated index:
value
group subgroup
A X 1
Y 2
X 3
B X 4
Y 5

Error when unstacking: Index contains duplicate entries, cannot reshape

After aggregating duplicates:
value
group subgroup
A X 4
Y 2
B X 4
Y 5

Unstacked result:
value
subgroup X Y
group
A 4 2
B 4 5

Issue 2: Stack and Unstack for Bidirectional Reshaping

Sometimes you need to go back and forth between stacked and unstacked forms:

python
# Start with a wide-format dataframe
wide_df = pd.DataFrame({
'State': ['California', 'Texas', 'Florida', 'New York'],
'2020': [39.5, 29.0, 21.5, 19.4],
'2021': [39.2, 29.5, 21.9, 19.3],
'2022': [39.0, 30.0, 22.2, 19.1]
}).set_index('State')

print("Wide format data (Population in millions):")
print(wide_df)

# Stack to convert to long format
long_df = wide_df.stack().reset_index()
long_df.columns = ['State', 'Year', 'Population']
print("\nLong format after stacking:")
print(long_df)

# Now unstack to go back to wide format
wide_again = long_df.set_index(['State', 'Year']).unstack('Year')
print("\nBack to wide format:")
print(wide_again)

Output:

Wide format data (Population in millions):
2020 2021 2022
State
California 39.5 39.2 39.0
Texas 29.0 29.5 30.0
Florida 21.5 21.9 22.2
New York 19.4 19.3 19.1

Long format after stacking:
State Year Population
0 California 2020 39.5
1 California 2021 39.2
2 California 2022 39.0
3 Texas 2020 29.0
4 Texas 2021 29.5
5 Texas 2022 30.0
6 Florida 2020 21.5
7 Florida 2021 21.9
8 Florida 2022 22.2
9 New York 2020 19.4
10 New York 2021 19.3
11 New York 2022 19.1

Back to wide format:
Population
Year 2020 2021 2022
State
California 39.5 39.2 39.0
Florida 21.5 21.9 22.2
New York 19.4 19.3 19.1
Texas 29.0 29.5 30.0

Summary

Unstacking is a key operation in Pandas' data reshaping toolkit that helps transform hierarchical data from long to wide format. Here's what we've covered:

  • The basic concept of unstacking and how it moves data from rows to columns
  • How to unstack different levels of a multi-index
  • Handling missing values when unstacking
  • Practical applications for analyzing and visualizing data
  • Common issues and their solutions

Mastering unstacking allows you to reshape your data in ways that make analysis and visualization more intuitive and effective.

Additional Resources and Exercises

Resources:

Exercises:

  1. Create a DataFrame with sales data by region, product, and month. Use unstacking to create a table that shows products as rows and months as columns for each region.

  2. Start with the following dataset:

    python
    data = pd.DataFrame({
    'Country': ['USA', 'USA', 'Canada', 'Canada', 'USA', 'Canada'],
    'City': ['New York', 'Los Angeles', 'Toronto', 'Vancouver', 'Chicago', 'Montreal'],
    'Category': ['Food', 'Electronics', 'Food', 'Electronics', 'Food', 'Electronics'],
    'Value': [100, 200, 150, 250, 120, 180]
    })

    Create a multi-index and unstack the data to compare values across cities within each country, broken down by category.

  3. Create a time series with hierarchical indices and practice reshaping it to find trends across different levels.

By practicing these exercises, you'll become more comfortable with unstacking and learn how to apply it effectively in your data analysis projects.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)