Pandas Unstacking

Unstacking is a powerful reshaping operation in Pandas that transforms data from a "long" format to a "wide" format. It's particularly useful when dealing with hierarchical indices (MultiIndex) and helps in creating more readable tabular views of your data.

What is Unstacking?

Unstacking is the process of pivoting a level of the row index to the column axis. Think of it as taking a specific index level and spreading its unique values across the columns. This operation converts data from a "stacked" format (where values are stacked vertically) to an "unstacked" format (where values are spread horizontally).

Basic Syntax

The basic syntax for unstacking a DataFrame or Series is:

df.unstack(level=-1)

Where:

level specifies which index level to unstack (default is the innermost level)

Understanding Unstacking with Examples

Let's start with some simple examples to understand how unstacking works.

Example 1: Basic Unstacking

First, let's create a multi-index Series:

import pandas as pd
import numpy as np

# Create a multi-index Series
index = pd.MultiIndex.from_tuples([
    ('A', 'x'), ('A', 'y'), ('A', 'z'),
    ('B', 'x'), ('B', 'y'), ('B', 'z')
], names=['first', 'second'])

s = pd.Series(np.random.randn(6), index=index)
print("Original Series:")
print(s)

Output:

Original Series:
first  second
A      x         0.469112
       y        -0.282863
       z        -1.509059
B      x        -1.135632
       y         1.212112
       z         0.119209
dtype: float64

Now, let's unstack this Series:

# Unstack the innermost level (level=-1 or level='second')
unstacked = s.unstack()
print("\nUnstacked Series (default level=-1):")
print(unstacked)

Output:

Unstacked Series (default level=-1):
second         x         y         z
first                              
A       0.469112 -0.282863 -1.509059
B      -1.135632  1.212112  0.119209

Notice how the second level of the index is now represented as columns, and each unique value in that level (x, y, and z) has become a column.

Example 2: Specifying the Level to Unstack

We can also specify which level to unstack:

# Create another multi-index Series
index = pd.MultiIndex.from_tuples([
    ('A', 'one', 'x'), ('A', 'one', 'y'), 
    ('A', 'two', 'x'), ('A', 'two', 'y'),
    ('B', 'one', 'x'), ('B', 'one', 'y'), 
    ('B', 'two', 'x'), ('B', 'two', 'y')
], names=['first', 'second', 'third'])

s2 = pd.Series(range(8), index=index)
print("Original Series with 3 levels:")
print(s2)

# Unstack the second level
unstacked_level1 = s2.unstack(level='second')
print("\nUnstacked at level 'second':")
print(unstacked_level1)

# Unstack the third level
unstacked_level2 = s2.unstack(level='third')
print("\nUnstacked at level 'third':")
print(unstacked_level2)

Output:

Original Series with 3 levels:
first  second  third
A      one     x        0
               y        1
       two     x        2
               y        3
B      one     x        4
               y        5
       two     x        6
               y        7
dtype: int64

Unstacked at level 'second':
third     x    y
first second      
A     one     0  1
      two     2  3
B     one     4  5
      two     6  7

Unstacked at level 'third':
third       x  y
first second    
A     one     0  1
      two     2  3
B     one     4  5
      two     6  7

Example 3: Unstacking DataFrames

Unstacking also works with DataFrames:

# Create a DataFrame with MultiIndex
index = pd.MultiIndex.from_product([
    ['A', 'B'], ['one', 'two']], names=['first', 'second'])
columns = pd.MultiIndex.from_product([
    ['a', 'b'], [1, 2]], names=['level1', 'level2'])

df = pd.DataFrame(np.random.randn(4, 4), index=index, columns=columns)
print("Original DataFrame:")
print(df)

# Unstack the second level of the index
unstacked = df.unstack('second')
print("\nUnstacked DataFrame (level='second'):")
print(unstacked)

Output:

Original DataFrame:
level1           a                   b          
level2           1         2         1         2
first second                                    
A     one    -0.173215  0.119209 -1.044236  0.861987
      two    -0.494929  1.071804  0.721555 -0.706771
B     one    -1.039575  0.271860 -1.087401 -0.172428
      two    -0.706893 -0.023514  0.567020  0.276232

Unstacked DataFrame (level='second'):
level1           a                             b                        
level2           1                 2           1                 2      
second         one       two      one       two      one       two      one       two
first                                                                             
A      -0.173215 -0.494929  0.119209  1.071804 -1.044236  0.721555  0.861987 -0.706771
B      -1.039575 -0.706893  0.271860 -0.023514 -1.087401  0.567020 -0.172428  0.276232

Handling Missing Data in Unstacked Results

Sometimes, when you unstack data, there might be missing combinations in your original data, which leads to NaN values in the result:

# Create a Series with some missing combinations
index = pd.MultiIndex.from_tuples([
    ('A', 'x'), ('A', 'y'), 
    ('B', 'x'),  # Note: ('B', 'y') is missing
], names=['first', 'second'])

s_missing = pd.Series([1, 2, 3], index=index)
print("Series with missing combination:")
print(s_missing)

# Unstack
unstacked_missing = s_missing.unstack()
print("\nUnstacked result with NaN:")
print(unstacked_missing)

# Fill NaN values
filled = s_missing.unstack().fillna(0)
print("\nUnstacked result with NaNs filled:")
print(filled)

Output:

Series with missing combination:
first  second
A      x         1
       y         2
B      x         3
dtype: int64

Unstacked result with NaN:
second    x    y
first            
A       1.0  2.0
B       3.0  NaN

Unstacked result with NaNs filled:
second    x    y
first            
A       1.0  2.0
B       3.0  0.0

Practical Applications

Example 1: Analyzing Sales Data

Let's say we have sales data with dimensions like region, product, and time:

# Create sample sales data
sales_data = pd.DataFrame({
    'Region': ['North', 'North', 'North', 'South', 'South', 'South'] * 2,
    'Quarter': ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3'] * 2,
    'Product': ['Widgets'] * 6 + ['Gadgets'] * 6,
    'Sales': [100, 110, 105, 95, 105, 115, 80, 85, 90, 70, 75, 80]
})

# Set multi-index for the DataFrame
sales_indexed = sales_data.set_index(['Product', 'Region', 'Quarter'])
print("Sales data with hierarchical index:")
print(sales_indexed)

# Unstack to create a more readable view by product and region
product_region_sales = sales_indexed.unstack('Quarter')
print("\nSales by Product and Region across Quarters:")
print(product_region_sales)

Output:

Sales data with hierarchical index:
                         Sales
Product Region Quarter        
Widgets North  Q1         100
              Q2         110
              Q3         105
       South  Q1          95
              Q2         105
              Q3         115
Gadgets North  Q1          80
              Q2          85
              Q3          90
       South  Q1          70
              Q2          75
              Q3          80

Sales by Product and Region across Quarters:
                Sales           
Quarter           Q1    Q2    Q3
Product Region                  
Widgets North     100   110   105
       South       95   105   115
Gadgets North      80    85    90
       South       70    75    80

This makes it much easier to compare quarterly sales for each product and region.

Example 2: Pivot Table-like Functionality

Unstacking can be combined with groupby to create pivot table-like summaries:

# Create sample data
data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=10),
    'Category': ['A', 'B', 'A', 'B', 'A', 'A', 'B', 'A', 'B', 'B'],
    'Region': ['East', 'East', 'West', 'West', 'East', 'West', 'East', 'West', 'East', 'West'],
    'Sales': [100, 150, 200, 50, 300, 250, 175, 225, 125, 75]
})

# Group by Category and Region, and sum the Sales
grouped = data.groupby(['Category', 'Region'])['Sales'].sum()
print("Grouped data:")
print(grouped)

# Unstack by Region to get a Category x Region view
category_region_sales = grouped.unstack('Region')
print("\nCategory by Region view (unstacked):")
print(category_region_sales)

Output:

Grouped data:
Category  Region
A         East      400
          West      675
B         East      275
          West      125
Name: Sales, dtype: int64

Category by Region view (unstacked):
Region    East  West
Category            
A          400   675
B          275   125

Common Issues and Solutions

Issue 1: "Index contains duplicate entries" Error

# Create data with duplicate index entries
duplicated_index = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B'],
    'subgroup': ['X', 'Y', 'X', 'X', 'Y'],  # Note: ('A', 'X') appears twice
    'value': [1, 2, 3, 4, 5]
}).set_index(['group', 'subgroup'])

print("DataFrame with duplicated index:")
print(duplicated_index)

# This will raise an error
try:
    duplicated_index.unstack()
except Exception as e:
    print(f"\nError when unstacking: {e}")

# Solution: Aggregate the duplicates first
aggregated = duplicated_index.groupby(level=[0, 1]).sum()
print("\nAfter aggregating duplicates:")
print(aggregated)

# Now unstack works
print("\nUnstacked result:")
print(aggregated.unstack())

Output:

DataFrame with duplicated index:
              value
group subgroup      
A     X            1
      Y            2
      X            3
B     X            4
      Y            5

Error when unstacking: Index contains duplicate entries, cannot reshape

After aggregating duplicates:
              value
group subgroup      
A     X            4
      Y            2
B     X            4
      Y            5

Unstacked result:
       value      
subgroup     X     Y
group               
A            4     2
B            4     5

Issue 2: Stack and Unstack for Bidirectional Reshaping

Sometimes you need to go back and forth between stacked and unstacked forms:

# Start with a wide-format dataframe
wide_df = pd.DataFrame({
    'State': ['California', 'Texas', 'Florida', 'New York'],
    '2020': [39.5, 29.0, 21.5, 19.4],
    '2021': [39.2, 29.5, 21.9, 19.3],
    '2022': [39.0, 30.0, 22.2, 19.1]
}).set_index('State')

print("Wide format data (Population in millions):")
print(wide_df)

# Stack to convert to long format
long_df = wide_df.stack().reset_index()
long_df.columns = ['State', 'Year', 'Population']
print("\nLong format after stacking:")
print(long_df)

# Now unstack to go back to wide format
wide_again = long_df.set_index(['State', 'Year']).unstack('Year')
print("\nBack to wide format:")
print(wide_again)

Output:

Wide format data (Population in millions):
           2020   2021   2022
State                        
California  39.5   39.2   39.0
Texas       29.0   29.5   30.0
Florida     21.5   21.9   22.2
New York    19.4   19.3   19.1

Long format after stacking:
        State  Year  Population
0  California  2020        39.5
1  California  2021        39.2
2  California  2022        39.0
3       Texas  2020        29.0
4       Texas  2021        29.5
5       Texas  2022        30.0
6     Florida  2020        21.5
7     Florida  2021        21.9
8     Florida  2022        22.2
9    New York  2020        19.4
10   New York  2021        19.3
11   New York  2022        19.1

Back to wide format:
        Population           
Year          2020   2021   2022
State                          
California    39.5   39.2   39.0
Florida       21.5   21.9   22.2
New York      19.4   19.3   19.1
Texas         29.0   29.5   30.0

Summary

Unstacking is a key operation in Pandas' data reshaping toolkit that helps transform hierarchical data from long to wide format. Here's what we've covered:

The basic concept of unstacking and how it moves data from rows to columns
How to unstack different levels of a multi-index
Handling missing values when unstacking
Practical applications for analyzing and visualizing data
Common issues and their solutions

Mastering unstacking allows you to reshape your data in ways that make analysis and visualization more intuitive and effective.

Additional Resources and Exercises

Resources:

Exercises:

Create a DataFrame with sales data by region, product, and month. Use unstacking to create a table that shows products as rows and months as columns for each region.

Start with the following dataset:

data = pd.DataFrame({
    'Country': ['USA', 'USA', 'Canada', 'Canada', 'USA', 'Canada'],
    'City': ['New York', 'Los Angeles', 'Toronto', 'Vancouver', 'Chicago', 'Montreal'],
    'Category': ['Food', 'Electronics', 'Food', 'Electronics', 'Food', 'Electronics'],
    'Value': [100, 200, 150, 250, 120, 180]
})

Create a multi-index and unstack the data to compare values across cities within each country, broken down by category.

Create a time series with hierarchical indices and practice reshaping it to find trends across different levels.

By practicing these exercises, you'll become more comfortable with unstacking and learn how to apply it effectively in your data analysis projects.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What is Unstacking?​

Basic Syntax​

Understanding Unstacking with Examples​

Example 1: Basic Unstacking​

Example 2: Specifying the Level to Unstack​

Example 3: Unstacking DataFrames​

Handling Missing Data in Unstacked Results​

Practical Applications​

Example 1: Analyzing Sales Data​

Example 2: Pivot Table-like Functionality​

Common Issues and Solutions​

Issue 1: "Index contains duplicate entries" Error​

Issue 2: Stack and Unstack for Bidirectional Reshaping​

Summary​

Additional Resources and Exercises​

Resources:​

Exercises:​

What is Unstacking?

Basic Syntax

Understanding Unstacking with Examples

Example 1: Basic Unstacking

Example 2: Specifying the Level to Unstack

Example 3: Unstacking DataFrames

Handling Missing Data in Unstacked Results

Practical Applications

Example 1: Analyzing Sales Data

Example 2: Pivot Table-like Functionality

Common Issues and Solutions

Issue 1: "Index contains duplicate entries" Error

Issue 2: Stack and Unstack for Bidirectional Reshaping

Summary

Additional Resources and Exercises

Resources:

Exercises: