Pandas Long to Wide

Introduction

When working with data in Python using Pandas, you'll often encounter datasets in different formats. Two common formats are "long" (or "tidy") format and "wide" format. Converting data from long to wide format is a frequent operation in data preprocessing and analysis.

In this tutorial, we'll explore:

What long and wide formats are
Why you might need to convert between them
Methods to transform data from long to wide format using Pandas
Practical examples and use cases

Understanding Long and Wide Data Formats

Long (Tidy) Format

Each variable forms a column
Each observation forms a row
Each value has its own cell
Often more normalized and flexible for analysis

Wide Format

Variables are spread across multiple columns
Each row represents an entity with multiple measurements
More compact and sometimes easier to read visually
Often preferred for reporting and certain visualizations

Methods to Convert Long to Wide Format in Pandas

Pandas provides several powerful functions to reshape data from long to wide format:

pivot()
pivot_table()
unstack()

Let's explore each with examples.

Using `pivot()`

The pivot() method is the most straightforward way to reshape data when you have perfectly clean data with unique identifiers.

Basic Syntax

df.pivot(index='index_column', columns='column_to_become_headers', values='values_column')

Example: Student Scores by Subject

import pandas as pd
import numpy as np

# Create a sample DataFrame in long format
data = {
    'student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie'],
    'subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Science', 'English'],
    'score': [85, 92, 78, 91, 87, 85, 76, 88, 91]
}

long_df = pd.DataFrame(data)
print("Long Format DataFrame:")
print(long_df)

Output:

Long Format DataFrame:
    student  subject  score
   Alice     Math     85
   Alice  Science     92
   Alice  English     78
     Bob     Math     91
     Bob  Science     87
     Bob  English     85
 Charlie     Math     76
 Charlie  Science     88
 Charlie  English     91

Now let's convert this to wide format using pivot():

# Convert to wide format
wide_df = long_df.pivot(index='student', columns='subject', values='score')
print("\nWide Format DataFrame:")
print(wide_df)

Output:

Wide Format DataFrame:
subject   English  Math  Science
student                         
Alice          78    85       92
Bob            85    91       87
Charlie        91    76       88

Notice how the subjects became column headers, with each student as a row. This makes it easy to compare scores across subjects for each student.

Using `pivot_table()`

When your data contains duplicates that would create conflicts in a simple pivot operation, pivot_table() comes to the rescue by allowing you to specify an aggregation function.

Basic Syntax

df.pivot_table(index='index_column', columns='column_to_become_headers', 
               values='values_column', aggfunc='aggregation_function')

Example: Average Sales by Region and Product

# Create a sample DataFrame with duplicate combinations
sales_data = {
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
    'region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'],
    'sales': [100, 120, 90, 95, 110, 85]
}

sales_df = pd.DataFrame(sales_data)
print("Sales Data (Long Format):")
print(sales_df)

# Convert to wide format using pivot_table
wide_sales = sales_df.pivot_table(
    index='region', 
    columns='product', 
    values='sales',
    aggfunc='mean'  # Calculate the mean for any duplicate combinations
)

print("\nAverage Sales by Region and Product (Wide Format):")
print(wide_sales)

Output:

Sales Data (Long Format):
         date region product  sales
0  2023-01-01  North  Widget    100
1  2023-01-01  North  Gadget    120
2  2023-01-02  South  Widget     90
3  2023-01-02  South  Gadget     95
4  2023-01-03  North  Widget    110
5  2023-01-03  South  Widget     85

Average Sales by Region and Product (Wide Format):
product   Gadget  Widget
region                  
North     120.0   105.0
South      95.0    87.5

You can use different aggregation functions with pivot_table():

'mean': Average (default)
'sum': Total
'count': Count of values
'min': Minimum value
'max': Maximum value
Or provide a custom function

Using `unstack()`

The unstack() method works on MultiIndex DataFrames and is particularly useful when working with grouped data.

Basic Syntax

# After setting a MultiIndex
df.unstack(level='level_name')

Example: Temperature Readings by City and Month

# Create a sample DataFrame
temp_data = {
    'city': ['New York', 'New York', 'New York', 'Chicago', 'Chicago', 'Chicago'],
    'month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
    'temperature': [32, 35, 45, 21, 25, 38]
}

temp_df = pd.DataFrame(temp_data)
print("Temperature Data (Long Format):")
print(temp_df)

# Set MultiIndex first
indexed_temp = temp_df.set_index(['city', 'month'])['temperature']
print("\nIndexed Temperature Data:")
print(indexed_temp)

# Use unstack to convert to wide format
wide_temp = indexed_temp.unstack(level='month')
print("\nTemperature Data (Wide Format):")
print(wide_temp)

Output:

Temperature Data (Long Format):
       city month  temperature
0  New York   Jan           32
1  New York   Feb           35
2  New York   Mar           45
3   Chicago   Jan           21
4   Chicago   Feb           25
5   Chicago   Mar           38

Indexed Temperature Data:
city      month
New York  Jan      32
          Feb      35
          Mar      45
Chicago   Jan      21
          Feb      25
          Mar      38
Name: temperature, dtype: int64

Temperature Data (Wide Format):
month      Jan  Feb  Mar
city                    
Chicago     21   25   38
New York    32   35   45

Practical Application: Time Series Analysis

A common use case for reshaping data is time series analysis, where you might want to compare metrics over time.

Example: Daily Online Store Metrics

# Create sample time series data
dates = pd.date_range('2023-01-01', periods=9, freq='D')
metrics = ['visitors', 'sales', 'conversion_rate'] * 3
values = [120, 15, 0.125, 145, 20, 0.138, 135, 18, 0.133,
          150, 25, 0.167, 180, 30, 0.167, 190, 32, 0.168,
          165, 28, 0.17, 140, 22, 0.157, 130, 19, 0.146]

ts_data = pd.DataFrame({
    'date': dates.repeat(3),
    'metric': metrics,
    'value': values
})

print("Time Series Data (Long Format):")
print(ts_data.head(9))

# Convert to wide format for analysis
ts_wide = ts_data.pivot(index='date', columns='metric', values='value')

print("\nTime Series Data (Wide Format):")
print(ts_wide.head())

# Now we can easily calculate correlations
print("\nCorrelation between metrics:")
print(ts_wide.corr())

# Or create a visualization (you would need matplotlib for this)

Output:

Time Series Data (Long Format):
        date           metric   value
0 2023-01-01        visitors  120.00
1 2023-01-01           sales   15.00
2 2023-01-01  conversion_rate    0.13
3 2023-01-02        visitors  145.00
4 2023-01-02           sales   20.00
5 2023-01-02  conversion_rate    0.14
6 2023-01-03        visitors  135.00
7 2023-01-03           sales   18.00
8 2023-01-03  conversion_rate    0.13

Time Series Data (Wide Format):
metric      conversion_rate  sales  visitors
date                                        
2023-01-01           0.125   15.0     120.0
2023-01-02           0.138   20.0     145.0
2023-01-03           0.133   18.0     135.0
2023-01-04           0.167   25.0     150.0
2023-01-05           0.167   30.0     180.0

Correlation between metrics:
metric           conversion_rate     sales  visitors
metric                                             
conversion_rate         1.000000  0.950708  0.888509
sales                   0.950708  1.000000  0.974943
visitors                0.888509  0.974943  1.000000

Handling Missing Values

When reshaping data, you may encounter missing values. Here's how to handle them:

# Create a sample DataFrame with missing values
missing_data = {
    'store': ['A', 'A', 'A', 'B', 'B', 'C'],
    'product': ['apple', 'banana', 'orange', 'apple', 'banana', 'apple'],
    'sales': [100, 120, 90, 80, 70, 110]
}

missing_df = pd.DataFrame(missing_data)
print("Original Data with Missing Combinations:")
print(missing_df)

# Convert to wide format with fill_value
wide_missing = missing_df.pivot_table(
    index='store', 
    columns='product', 
    values='sales',
    fill_value=0  # Replace NaN with zeros
)

print("\nWide Format with Filled Values:")
print(wide_missing)

Output:

Original Data with Missing Combinations:
  store product  sales
0     A   apple    100
1     A  banana    120
2     A  orange     90
3     B   apple     80
4     B  banana     70
5     C   apple    110

Wide Format with Filled Values:
product  apple  banana  orange
store                         
A         100     120      90
B          80      70       0
C         110       0       0

When to Use Long vs. Wide Format

Use Long Format When:

Performing statistical analysis with tools that expect tidy data
Working with variable numbers of categories
You need to filter, group, or aggregate data easily

Use Wide Format When:

Creating certain types of visualizations (e.g., heatmaps)
Building reports or dashboards for human readability
Performing calculations across related columns
Working with time series data where each time point should be a column

Summary

In this tutorial, we learned how to convert Pandas DataFrames from long (tidy) format to wide format using three main methods:

pivot(): Best for clean data without duplicates
pivot_table(): Handles duplicate entries by applying aggregation functions
unstack(): Works with MultiIndex DataFrames and is useful after grouping operations

Reshaping data is a fundamental skill in data wrangling that allows you to structure your data in the most appropriate way for specific analyses or visualizations.

Exercises

To practice your reshaping skills, try these exercises:

Download a dataset about weather measurements (temperature, humidity, etc.) from different cities over time. Convert it from long to wide format to compare cities side by side.
Take the wide format dataset from Exercise 1 and convert it back to long format.
Create a DataFrame with sales data by product, store, and date. Use pivot_table() to find the average sales by product across all stores and dates.
Challenge: Find a dataset with multiple measurements per category, and reshape it from long to wide while simultaneously calculating multiple aggregations (mean, max, and count).

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Long and Wide Data Formats​

Long (Tidy) Format​

Wide Format​

Methods to Convert Long to Wide Format in Pandas​

Using pivot()​

Basic Syntax​

Example: Student Scores by Subject​

Using pivot_table()​

Basic Syntax​

Example: Average Sales by Region and Product​

Using unstack()​

Basic Syntax​

Example: Temperature Readings by City and Month​

Practical Application: Time Series Analysis​

Example: Daily Online Store Metrics​

Handling Missing Values​

When to Use Long vs. Wide Format​

Use Long Format When:​

Use Wide Format When:​

Summary​

Exercises​

Additional Resources​

Introduction

Understanding Long and Wide Data Formats

Long (Tidy) Format

Wide Format

Methods to Convert Long to Wide Format in Pandas

Using `pivot()`

Basic Syntax

Example: Student Scores by Subject

Using `pivot_table()`

Basic Syntax

Example: Average Sales by Region and Product

Using `unstack()`

Basic Syntax

Example: Temperature Readings by City and Month

Practical Application: Time Series Analysis

Example: Daily Online Store Metrics

Handling Missing Values

When to Use Long vs. Wide Format

Use Long Format When:

Use Wide Format When:

Summary

Exercises

Additional Resources