Pandas Wide to Long
Data often comes in different shapes and formats. Two common formats you'll encounter are "wide" and "long" formats. Converting data from wide to long format is a frequent task in data preparation and analysis. In this tutorial, we'll explore how to transform wide-format data to long-format using Pandas.
What are Wide and Long Formats?
Before we dive into the transformation process, let's understand the difference between wide and long formats:
- Wide format: Each entity has a single row with multiple columns representing different variables.
- Long format: Each entity has multiple rows, with one row for each variable-value combination.
Long format is often preferred for data analysis, visualization with libraries like Seaborn, and statistical modeling.
Methods to Convert from Wide to Long
Pandas provides two primary functions for wide-to-long transformations:
pd.melt()
: A flexible function for unpivoting datapd.wide_to_long()
: A specialized function for certain reshaping patterns
Let's explore both with examples!
Using pd.melt()
The melt()
function is versatile and can handle many wide-to-long transformations.
Basic Syntax
pd.melt(df,
id_vars=[columns to keep],
value_vars=[columns to unpivot],
var_name='name for variable column',
value_name='name for value column')
Example 1: Basic Melting
Let's start with a simple example:
import pandas as pd
import numpy as np
# Create a wide-format dataframe
wide_df = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4'],
'math_score': [85, 78, 92, 88],
'science_score': [90, 82, 85, 94],
'history_score': [75, 85, 80, 88]
})
print("Wide format data:")
print(wide_df)
# Melt the dataframe to convert to long format
long_df = pd.melt(
wide_df,
id_vars=['student_id'],
value_vars=['math_score', 'science_score', 'history_score'],
var_name='subject',
value_name='score'
)
print("\nLong format data:")
print(long_df)
Output:
Wide format data:
student_id math_score science_score history_score
0 S1 85 90 75
1 S2 78 82 85
2 S3 92 85 80
3 S4 88 94 88
Long format data:
student_id subject score
0 S1 math_score 85
1 S2 math_score 78
2 S3 math_score 92
3 S4 math_score 88
4 S1 science_score 90
5 S2 science_score 82
6 S3 science_score 85
7 S4 science_score 94
8 S1 history_score 75
9 S2 history_score 85
10 S3 history_score 80
11 S4 history_score 88
Example 2: Cleaning Column Names After Melting
Sometimes we need to clean up the variable names after melting:
# Create a wide-format dataframe with more structured column names
wide_df = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4'],
'score_math_2022': [85, 78, 92, 88],
'score_science_2022': [90, 82, 85, 94],
'score_math_2023': [88, 80, 95, 90],
'score_science_2023': [92, 85, 88, 96]
})
print("Wide format data:")
print(wide_df)
# Melt the dataframe
long_df = pd.melt(
wide_df,
id_vars=['student_id'],
value_vars=['score_math_2022', 'score_science_2022',
'score_math_2023', 'score_science_2023'],
var_name='measurement',
value_name='score'
)
# Extract subject and year from the measurement column
long_df[['prefix', 'subject', 'year']] = long_df['measurement'].str.split('_', expand=True)
long_df = long_df.drop(['prefix', 'measurement'], axis=1)
print("\nLong format data with extracted components:")
print(long_df)
Output:
Wide format data:
student_id score_math_2022 score_science_2022 score_math_2023 score_science_2023
0 S1 85 90 88 92
1 S2 78 82 80 85
2 S3 92 85 95 88
3 S4 88 94 90 96
Long format data with extracted components:
student_id score subject year
0 S1 85 math 2022
1 S2 78 math 2022
2 S3 92 math 2022
3 S4 88 math 2022
4 S1 90 science 2022
5 S2 82 science 2022
6 S3 85 science 2022
7 S4 94 science 2022
8 S1 88 math 2023
9 S2 80 math 2023
10 S3 95 math 2023
11 S4 90 math 2023
12 S1 92 science 2023
13 S2 85 science 2023
14 S3 88 science 2023
15 S4 96 science 2023
Using pd.wide_to_long()
The wide_to_long()
function is designed for a specific pattern of wide data where column names have a common format.
Basic Syntax
pd.wide_to_long(df,
stubnames=['common prefix'],
i=['id columns'],
j='name for j column')
Example: Wide to Long with Structured Column Names
import pandas as pd
# Create a wide-format dataframe with structured column names
wide_df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['John', 'Jane', 'Mike'],
'height_2020': [175, 165, 180],
'weight_2020': [70, 55, 80],
'height_2021': [176, 166, 181],
'weight_2021': [72, 56, 78],
'height_2022': [177, 167, 182],
'weight_2022': [73, 57, 77]
})
print("Wide format data:")
print(wide_df)
# Using wide_to_long
long_df = pd.wide_to_long(
wide_df,
stubnames=['height', 'weight'],
i=['id', 'name'],
j='year'
)
print("\nLong format data:")
print(long_df)
# Reset index if needed
long_df_reset = long_df.reset_index()
print("\nLong format data with reset index:")
print(long_df_reset)
Output:
Wide format data:
id name height_2020 weight_2020 height_2021 weight_2021 height_2022 weight_2022
0 1 John 175 70 176 72 177 73
1 2 Jane 165 55 166 56 167 57
2 3 Mike 180 80 181 78 182 77
Long format data:
height weight
id name year
1 John 2020 175 70
2 Jane 2020 165 55
3 Mike 2020 180 80
1 John 2021 176 72
2 Jane 2021 166 56
3 Mike 2021 181 78
1 John 2022 177 73
2 Jane 2022 167 57
3 Mike 2022 182 77
Long format data with reset index:
id name year height weight
0 1 John 2020 175 70
1 2 Jane 2020 165 55
2 3 Mike 2020 180 80
3 1 John 2021 176 72
4 2 Jane 2021 166 56
5 3 Mike 2021 181 78
6 1 John 2022 177 73
7 2 Jane 2022 167 57
8 3 Mike 2022 182 77
Real-world Application: Analyzing Health Metrics Over Time
Let's see a practical example where reshaping from wide to long is useful:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Sample patient data in wide format
patients_wide = pd.DataFrame({
'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
'gender': ['M', 'F', 'M', 'F', 'M'],
'age': [45, 52, 35, 62, 41],
'bp_jan': [130, 142, 125, 155, 135],
'bp_feb': [128, 138, 127, 150, 132],
'bp_mar': [129, 135, 124, 148, 130],
'cholesterol_jan': [220, 245, 200, 255, 210],
'cholesterol_feb': [215, 235, 205, 245, 208],
'cholesterol_mar': [210, 230, 198, 240, 205]
})
print("Original patient data (wide format):")
print(patients_wide)
# First, melt blood pressure data
bp_long = pd.melt(
patients_wide,
id_vars=['patient_id', 'gender', 'age'],
value_vars=['bp_jan', 'bp_feb', 'bp_mar'],
var_name='month',
value_name='blood_pressure'
)
# Clean up month column
bp_long['month'] = bp_long['month'].str.replace('bp_', '')
# Next, melt cholesterol data
chol_long = pd.melt(
patients_wide,
id_vars=['patient_id', 'gender', 'age'],
value_vars=['cholesterol_jan', 'cholesterol_feb', 'cholesterol_mar'],
var_name='month',
value_name='cholesterol'
)
# Clean up month column
chol_long['month'] = chol_long['month'].str.replace('cholesterol_', '')
# Merge the melted dataframes
metrics_long = bp_long.merge(
chol_long,
on=['patient_id', 'gender', 'age', 'month']
)
print("\nReshaped patient data (long format):")
print(metrics_long)
# Now we can easily perform analysis by month or gender
plt.figure(figsize=(12, 5))
# Create first subplot for blood pressure trends
plt.subplot(1, 2, 1)
sns.lineplot(data=metrics_long, x='month', y='blood_pressure', hue='gender', marker='o')
plt.title('Blood Pressure Trends by Gender')
plt.ylabel('Blood Pressure (mmHg)')
# Create second subplot for cholesterol trends
plt.subplot(1, 2, 2)
sns.lineplot(data=metrics_long, x='month', y='cholesterol', hue='gender', marker='o')
plt.title('Cholesterol Trends by Gender')
plt.ylabel('Cholesterol (mg/dL)')
plt.tight_layout()
# plt.show() # Uncomment to display the plot
The resultant plots would show both blood pressure and cholesterol trends over the three months, separated by gender. This visualization wouldn't be as straightforward with the original wide-format data.
When to Use Wide vs. Long Format
Use wide format when:
- You need a compact representation of data
- You're creating summary tables for reporting
- Each variable has its own significance and is rarely analyzed together
Use long format when:
- You're using visualization libraries like Seaborn or ggplot
- You need to perform statistical analysis across variables
- You're working with time-series data
- You need to apply operations uniformly across variables
Summary
In this tutorial, we explored two powerful Pandas functions for reshaping data from wide to long format:
pd.melt()
: A flexible function that unpivots columns into rowspd.wide_to_long()
: A specialized function for structured column names
Converting data between wide and long formats is a fundamental skill in data preparation. The choice of format depends on your specific analysis needs, but long format is generally more versatile for statistical analysis and visualization.
Exercises for Practice
- Take a dataset of student scores across multiple subjects and years in wide format and convert it to long format.
- Create a wide-format dataset of weather data (temperature, humidity, etc.) for different cities and transform it to long format for easier visualization.
- Use the
wide_to_long()
function to reshape a dataset with multiple measurements taken at different time points. - Convert a wide-format time series of stock prices for multiple companies to long format.
Additional Resources
- Pandas Documentation: melt
- Pandas Documentation: wide_to_long
- Tidy Data Paper by Hadley Wickham
- Data Reshaping in Python
Understanding how to reshape data efficiently will significantly enhance your data wrangling capabilities and make subsequent analysis much more straightforward!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)