Pandas Long to Wide
Introduction
When working with data in Python using Pandas, you'll often encounter datasets in different formats. Two common formats are "long" (or "tidy") format and "wide" format. Converting data from long to wide format is a frequent operation in data preprocessing and analysis.
In this tutorial, we'll explore:
- What long and wide formats are
- Why you might need to convert between them
- Methods to transform data from long to wide format using Pandas
- Practical examples and use cases
Understanding Long and Wide Data Formats
Long (Tidy) Format
- Each variable forms a column
- Each observation forms a row
- Each value has its own cell
- Often more normalized and flexible for analysis
Wide Format
- Variables are spread across multiple columns
- Each row represents an entity with multiple measurements
- More compact and sometimes easier to read visually
- Often preferred for reporting and certain visualizations
Methods to Convert Long to Wide Format in Pandas
Pandas provides several powerful functions to reshape data from long to wide format:
pivot()
pivot_table()
unstack()
Let's explore each with examples.
Using pivot()
The pivot()
method is the most straightforward way to reshape data when you have perfectly clean data with unique identifiers.
Basic Syntax
df.pivot(index='index_column', columns='column_to_become_headers', values='values_column')
Example: Student Scores by Subject
import pandas as pd
import numpy as np
# Create a sample DataFrame in long format
data = {
'student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie'],
'subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Science', 'English'],
'score': [85, 92, 78, 91, 87, 85, 76, 88, 91]
}
long_df = pd.DataFrame(data)
print("Long Format DataFrame:")
print(long_df)
Output:
Long Format DataFrame:
student subject score
0 Alice Math 85
1 Alice Science 92
2 Alice English 78
3 Bob Math 91
4 Bob Science 87
5 Bob English 85
6 Charlie Math 76
7 Charlie Science 88
8 Charlie English 91
Now let's convert this to wide format using pivot()
:
# Convert to wide format
wide_df = long_df.pivot(index='student', columns='subject', values='score')
print("\nWide Format DataFrame:")
print(wide_df)
Output:
Wide Format DataFrame:
subject English Math Science
student
Alice 78 85 92
Bob 85 91 87
Charlie 91 76 88
Notice how the subjects became column headers, with each student as a row. This makes it easy to compare scores across subjects for each student.
Using pivot_table()
When your data contains duplicates that would create conflicts in a simple pivot operation, pivot_table()
comes to the rescue by allowing you to specify an aggregation function.
Basic Syntax
df.pivot_table(index='index_column', columns='column_to_become_headers',
values='values_column', aggfunc='aggregation_function')
Example: Average Sales by Region and Product
# Create a sample DataFrame with duplicate combinations
sales_data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
'region': ['North', 'North', 'South', 'South', 'North', 'South'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'],
'sales': [100, 120, 90, 95, 110, 85]
}
sales_df = pd.DataFrame(sales_data)
print("Sales Data (Long Format):")
print(sales_df)
# Convert to wide format using pivot_table
wide_sales = sales_df.pivot_table(
index='region',
columns='product',
values='sales',
aggfunc='mean' # Calculate the mean for any duplicate combinations
)
print("\nAverage Sales by Region and Product (Wide Format):")
print(wide_sales)
Output:
Sales Data (Long Format):
date region product sales
0 2023-01-01 North Widget 100
1 2023-01-01 North Gadget 120
2 2023-01-02 South Widget 90
3 2023-01-02 South Gadget 95
4 2023-01-03 North Widget 110
5 2023-01-03 South Widget 85
Average Sales by Region and Product (Wide Format):
product Gadget Widget
region
North 120.0 105.0
South 95.0 87.5
You can use different aggregation functions with pivot_table()
:
'mean'
: Average (default)'sum'
: Total'count'
: Count of values'min'
: Minimum value'max'
: Maximum value- Or provide a custom function
Using unstack()
The unstack()
method works on MultiIndex DataFrames and is particularly useful when working with grouped data.
Basic Syntax
# After setting a MultiIndex
df.unstack(level='level_name')
Example: Temperature Readings by City and Month
# Create a sample DataFrame
temp_data = {
'city': ['New York', 'New York', 'New York', 'Chicago', 'Chicago', 'Chicago'],
'month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'temperature': [32, 35, 45, 21, 25, 38]
}
temp_df = pd.DataFrame(temp_data)
print("Temperature Data (Long Format):")
print(temp_df)
# Set MultiIndex first
indexed_temp = temp_df.set_index(['city', 'month'])['temperature']
print("\nIndexed Temperature Data:")
print(indexed_temp)
# Use unstack to convert to wide format
wide_temp = indexed_temp.unstack(level='month')
print("\nTemperature Data (Wide Format):")
print(wide_temp)
Output:
Temperature Data (Long Format):
city month temperature
0 New York Jan 32
1 New York Feb 35
2 New York Mar 45
3 Chicago Jan 21
4 Chicago Feb 25
5 Chicago Mar 38
Indexed Temperature Data:
city month
New York Jan 32
Feb 35
Mar 45
Chicago Jan 21
Feb 25
Mar 38
Name: temperature, dtype: int64
Temperature Data (Wide Format):
month Jan Feb Mar
city
Chicago 21 25 38
New York 32 35 45
Practical Application: Time Series Analysis
A common use case for reshaping data is time series analysis, where you might want to compare metrics over time.
Example: Daily Online Store Metrics
# Create sample time series data
dates = pd.date_range('2023-01-01', periods=9, freq='D')
metrics = ['visitors', 'sales', 'conversion_rate'] * 3
values = [120, 15, 0.125, 145, 20, 0.138, 135, 18, 0.133,
150, 25, 0.167, 180, 30, 0.167, 190, 32, 0.168,
165, 28, 0.17, 140, 22, 0.157, 130, 19, 0.146]
ts_data = pd.DataFrame({
'date': dates.repeat(3),
'metric': metrics,
'value': values
})
print("Time Series Data (Long Format):")
print(ts_data.head(9))
# Convert to wide format for analysis
ts_wide = ts_data.pivot(index='date', columns='metric', values='value')
print("\nTime Series Data (Wide Format):")
print(ts_wide.head())
# Now we can easily calculate correlations
print("\nCorrelation between metrics:")
print(ts_wide.corr())
# Or create a visualization (you would need matplotlib for this)
Output:
Time Series Data (Long Format):
date metric value
0 2023-01-01 visitors 120.00
1 2023-01-01 sales 15.00
2 2023-01-01 conversion_rate 0.13
3 2023-01-02 visitors 145.00
4 2023-01-02 sales 20.00
5 2023-01-02 conversion_rate 0.14
6 2023-01-03 visitors 135.00
7 2023-01-03 sales 18.00
8 2023-01-03 conversion_rate 0.13
Time Series Data (Wide Format):
metric conversion_rate sales visitors
date
2023-01-01 0.125 15.0 120.0
2023-01-02 0.138 20.0 145.0
2023-01-03 0.133 18.0 135.0
2023-01-04 0.167 25.0 150.0
2023-01-05 0.167 30.0 180.0
Correlation between metrics:
metric conversion_rate sales visitors
metric
conversion_rate 1.000000 0.950708 0.888509
sales 0.950708 1.000000 0.974943
visitors 0.888509 0.974943 1.000000
Handling Missing Values
When reshaping data, you may encounter missing values. Here's how to handle them:
# Create a sample DataFrame with missing values
missing_data = {
'store': ['A', 'A', 'A', 'B', 'B', 'C'],
'product': ['apple', 'banana', 'orange', 'apple', 'banana', 'apple'],
'sales': [100, 120, 90, 80, 70, 110]
}
missing_df = pd.DataFrame(missing_data)
print("Original Data with Missing Combinations:")
print(missing_df)
# Convert to wide format with fill_value
wide_missing = missing_df.pivot_table(
index='store',
columns='product',
values='sales',
fill_value=0 # Replace NaN with zeros
)
print("\nWide Format with Filled Values:")
print(wide_missing)
Output:
Original Data with Missing Combinations:
store product sales
0 A apple 100
1 A banana 120
2 A orange 90
3 B apple 80
4 B banana 70
5 C apple 110
Wide Format with Filled Values:
product apple banana orange
store
A 100 120 90
B 80 70 0
C 110 0 0
When to Use Long vs. Wide Format
Use Long Format When:
- Performing statistical analysis with tools that expect tidy data
- Working with variable numbers of categories
- You need to filter, group, or aggregate data easily
Use Wide Format When:
- Creating certain types of visualizations (e.g., heatmaps)
- Building reports or dashboards for human readability
- Performing calculations across related columns
- Working with time series data where each time point should be a column
Summary
In this tutorial, we learned how to convert Pandas DataFrames from long (tidy) format to wide format using three main methods:
pivot()
: Best for clean data without duplicatespivot_table()
: Handles duplicate entries by applying aggregation functionsunstack()
: Works with MultiIndex DataFrames and is useful after grouping operations
Reshaping data is a fundamental skill in data wrangling that allows you to structure your data in the most appropriate way for specific analyses or visualizations.
Exercises
To practice your reshaping skills, try these exercises:
-
Download a dataset about weather measurements (temperature, humidity, etc.) from different cities over time. Convert it from long to wide format to compare cities side by side.
-
Take the wide format dataset from Exercise 1 and convert it back to long format.
-
Create a DataFrame with sales data by product, store, and date. Use
pivot_table()
to find the average sales by product across all stores and dates. -
Challenge: Find a dataset with multiple measurements per category, and reshape it from long to wide while simultaneously calculating multiple aggregations (mean, max, and count).
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)