Pandas Reshape Methods

Data rarely comes in the exact format we need for analysis. Pandas provides powerful reshape methods that allow you to transform your data from one format to another. In this tutorial, we'll explore various techniques to reshape your data efficiently.

Introduction to Data Reshaping

Data reshaping involves rearranging data between wide format (where each variable has its own column) and long format (where variables are stored in rows). These operations are essential for:

Data cleaning and preparation
Statistical modeling
Visualization
Feature engineering

Let's dive into the key reshaping methods in Pandas!

Prerequisites

To follow this tutorial, you need:

Basic knowledge of Python
Familiarity with Pandas DataFrames and Series
Pandas installed in your environment

import pandas as pd
import numpy as np

Pivot: From Long to Wide Format

The pivot() method transforms long data into wide format by using unique values from one column as new column headers.

Basic Pivot Example

# Create a sample DataFrame in long format
data = {
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'city': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
    'temperature': [32, 65, 30, 68]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
         date        city  temperature
0  2023-01-01    New York           32
1  2023-01-01  Los Angeles          65
2  2023-01-02    New York           30
3  2023-01-02  Los Angeles          68

Now let's pivot this data to create a new DataFrame where cities become columns:

# Pivot the data: days as rows, cities as columns
pivoted = df.pivot(index='date', columns='city', values='temperature')
print("\nPivoted DataFrame:")
print(pivoted)

Output:

Pivoted DataFrame:
city        Los Angeles  New York
date                             
2023-01-01           65        32
2023-01-02           68        30

Pivot with Multiple Values

You can pivot multiple value columns at once by omitting the values parameter:

# Add another measurement column
df['humidity'] = [80, 50, 85, 55]
print("\nDataFrame with multiple measurements:")
print(df)

# Pivot with multiple values
pivoted_multi = df.pivot(index='date', columns='city')
print("\nPivoted with multiple values:")
print(pivoted_multi)

Output:

DataFrame with multiple measurements:
         date        city  temperature  humidity
0  2023-01-01    New York           32        80
1  2023-01-01  Los Angeles          65        50
2  2023-01-02    New York           30        85
3  2023-01-02  Los Angeles          68        55

Pivoted with multiple values:
           temperature           humidity          
city      Los Angeles New York Los Angeles New York
date                                               
2023-01-01          65       32          50       80
2023-01-02          68       30          55       85

Handling Duplicate Entries

If there are duplicate entries in your index and columns, pivot() will raise an error. To handle duplicates, use pivot_table() which can aggregate duplicate values:

# Create data with duplicates
data_dupe = {
    'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02'],
    'city': ['New York', 'New York', 'Los Angeles', 'Los Angeles'],
    'temperature': [32, 34, 65, 68]
}

df_dupe = pd.DataFrame(data_dupe)
print("DataFrame with duplicates:")
print(df_dupe)

# Use pivot_table with aggregation
pivoted_table = df_dupe.pivot_table(index='date', columns='city', values='temperature', aggfunc='mean')
print("\nPivoted table with aggregation:")
print(pivoted_table)

Output:

DataFrame with duplicates:
         date        city  temperature
0  2023-01-01    New York           32
1  2023-01-01    New York           34
2  2023-01-01  Los Angeles          65
3  2023-01-02  Los Angeles          68

Pivoted table with aggregation:
city        Los Angeles  New York
date                             
2023-01-01         65.0      33.0
2023-01-02         68.0       NaN

Melt: From Wide to Long Format

The melt() function does the opposite of pivot() - it transforms wide-format data into long format.

Basic Melt Example

# Create a sample wide-format DataFrame
wide_df = pd.DataFrame({
    'name': ['John', 'Mary', 'Bob'],
    'math': [90, 85, 92],
    'science': [88, 95, 79],
    'history': [76, 82, 89]
})
print("Original wide DataFrame:")
print(wide_df)

# Melt the DataFrame
melted = wide_df.melt(id_vars=['name'], var_name='subject', value_name='score')
print("\nMelted DataFrame:")
print(melted)

Output:

Original wide DataFrame:
   name  math  science  history
John    90       88       76
Mary    85       95       82
 Bob    92       79       89

Melted DataFrame:
   name  subject  score
John     math     90
Mary     math     85
 Bob     math     92
John  science     88
Mary  science     95
 Bob  science     79
John  history     76
Mary  history     82
 Bob  history     89

Selective Melting

You can selectively melt only specific columns:

# Melt only math and science columns
melted_selective = wide_df.melt(
    id_vars=['name'], 
    value_vars=['math', 'science'],
    var_name='subject', 
    value_name='score'
)
print("\nSelectively melted DataFrame:")
print(melted_selective)

Output:

Selectively melted DataFrame:
   name subject  score
John    math     90
Mary    math     85
 Bob    math     92
John science     88
Mary science     95
 Bob science     79

Stack and Unstack: Working with MultiIndex

The stack() and unstack() methods reshape data by moving the innermost column index to become the innermost row index (stacking) or vice versa (unstacking).

Stacking Example

# Create a multi-column DataFrame
multi_df = pd.DataFrame({
    ('A', 'one'): [1, 2, 3],
    ('A', 'two'): [4, 5, 6],
    ('B', 'one'): [7, 8, 9],
    ('B', 'two'): [10, 11, 12]
})
multi_df.columns = pd.MultiIndex.from_tuples(multi_df.columns)
print("Multi-column DataFrame:")
print(multi_df)

# Stack the DataFrame
stacked = multi_df.stack()
print("\nStacked DataFrame:")
print(stacked)

Output:

Multi-column DataFrame:
   A      B    
  one two one two
0   1   4   7  10
1   2   5   8  11
2   3   6   9  12

Stacked DataFrame:
     A    B
0 one  1    7
  two  4   10
1 one  2    8
  two  5   11
2 one  3    9
  two  6   12

Unstacking Example

You can also unstack a stacked DataFrame:

# Unstack the first level
unstacked = stacked.unstack(0)
print("\nUnstacked DataFrame (level 0):")
print(unstacked)

# Unstack the first level from original stacked
unstacked_level1 = stacked.unstack(1)
print("\nUnstacked DataFrame (level 1):")
print(unstacked_level1)

Output:

Unstacked DataFrame (level 0):
       0  1  2
A one  1  2  3
  two  4  5  6
B one  7  8  9
  two 10 11 12

Unstacked DataFrame (level 1):
   one  two
0 A   1    4
  B   7   10
1 A   2    5
  B   8   11
2 A   3    6
  B   9   12

Real-World Application: Data Analysis for Monthly Sales

Let's explore a practical example using reshape methods for sales data analysis:

# Create a sample sales dataset
data = {
    'date': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-15', '2023-03-05', '2023-03-20'],
    'product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'region': ['East', 'East', 'West', 'West', 'East', 'West'],
    'sales': [200, 150, 300, 250, 220, 180]
}

sales_df = pd.DataFrame(data)
# Convert date to datetime
sales_df['date'] = pd.to_datetime(sales_df['date'])
# Extract month for easier aggregation
sales_df['month'] = sales_df['date'].dt.strftime('%Y-%m')

print("Original sales data:")
print(sales_df)

# Reshape the data to analyze monthly sales by product
monthly_product_sales = sales_df.pivot_table(
    index='month', 
    columns='product', 
    values='sales', 
    aggfunc='sum'
)

print("\nMonthly sales by product:")
print(monthly_product_sales)

# Reshape to analyze regional performance by product
regional_product_sales = sales_df.pivot_table(
    index='region', 
    columns='product', 
    values='sales', 
    aggfunc='sum'
)

print("\nRegional sales by product:")
print(regional_product_sales)

# Convert wide format to long format for visualization
monthly_sales_long = monthly_product_sales.reset_index().melt(
    id_vars=['month'], 
    var_name='product', 
    value_name='total_sales'
)

print("\nMonthly sales in long format (ready for visualization):")
print(monthly_sales_long)

Output:

Original sales data:
        date product region  sales    month
0 2023-01-15       A   East    200  2023-01
1 2023-01-20       B   East    150  2023-01
2 2023-02-10       A   West    300  2023-02
3 2023-02-15       B   West    250  2023-02
4 2023-03-05       A   East    220  2023-03
5 2023-03-20       B   West    180  2023-03

Monthly sales by product:
product    A    B
month            
2023-01  200  150
2023-02  300  250
2023-03  220  180

Regional sales by product:
product    A    B
region          
East    420  150
West    300  430

Monthly sales in long format (ready for visualization):
     month product  total_sales
0  2023-01       A          200
1  2023-02       A          300
2  2023-03       A          220
3  2023-01       B          150
4  2023-02       B          250
5  2023-03       B          180

Summary

In this tutorial, we explored Pandas' powerful methods for reshaping data:

Pivot: Transforms data from long format to wide format
Pivot_table: Similar to pivot but handles duplicate entries through aggregation
Melt: Transforms data from wide format to long format
Stack/Unstack: Reshapes data by manipulating levels in MultiIndex DataFrames

Choosing the right reshape method depends on your data structure and analysis needs. These techniques are essential for data cleaning, transformation, and preparation for visualization or modeling.

Exercises

To solidify your understanding, try these exercises:

Create a DataFrame with daily weather data (date, city, temperature, humidity) and convert it to a wide format with cities as columns.
Start with a DataFrame of student grades by subject (student, math, science, history) and convert it to long format.
Create a MultiIndex DataFrame with sales data (year, quarter, product, revenue) and use stack/unstack to explore different views of the data.
Take a real dataset of your choice and identify a reshaping operation that would make analysis easier.

Additional Resources

Happy data reshaping!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Data Reshaping​

Prerequisites​

Pivot: From Long to Wide Format​

Basic Pivot Example​

Pivot with Multiple Values​

Handling Duplicate Entries​

Melt: From Wide to Long Format​

Basic Melt Example​

Selective Melting​

Stack and Unstack: Working with MultiIndex​

Stacking Example​

Unstacking Example​

Real-World Application: Data Analysis for Monthly Sales​

Summary​

Exercises​

Additional Resources​

Introduction to Data Reshaping

Prerequisites

Pivot: From Long to Wide Format

Basic Pivot Example

Pivot with Multiple Values

Handling Duplicate Entries

Melt: From Wide to Long Format

Basic Melt Example

Selective Melting

Stack and Unstack: Working with MultiIndex

Stacking Example

Unstacking Example

Real-World Application: Data Analysis for Monthly Sales

Summary

Exercises

Additional Resources