Pandas Wide to Long

Data often comes in different shapes and formats. Two common formats you'll encounter are "wide" and "long" formats. Converting data from wide to long format is a frequent task in data preparation and analysis. In this tutorial, we'll explore how to transform wide-format data to long-format using Pandas.

What are Wide and Long Formats?

Before we dive into the transformation process, let's understand the difference between wide and long formats:

Wide format: Each entity has a single row with multiple columns representing different variables.
Long format: Each entity has multiple rows, with one row for each variable-value combination.

Long format is often preferred for data analysis, visualization with libraries like Seaborn, and statistical modeling.

Methods to Convert from Wide to Long

Pandas provides two primary functions for wide-to-long transformations:

pd.melt(): A flexible function for unpivoting data
pd.wide_to_long(): A specialized function for certain reshaping patterns

Let's explore both with examples!

Using `pd.melt()`

The melt() function is versatile and can handle many wide-to-long transformations.

Basic Syntax

pd.melt(df, 
        id_vars=[columns to keep], 
        value_vars=[columns to unpivot],
        var_name='name for variable column', 
        value_name='name for value column')

Example 1: Basic Melting

Let's start with a simple example:

import pandas as pd
import numpy as np

# Create a wide-format dataframe
wide_df = pd.DataFrame({
    'student_id': ['S1', 'S2', 'S3', 'S4'],
    'math_score': [85, 78, 92, 88],
    'science_score': [90, 82, 85, 94],
    'history_score': [75, 85, 80, 88]
})

print("Wide format data:")
print(wide_df)

# Melt the dataframe to convert to long format
long_df = pd.melt(
    wide_df, 
    id_vars=['student_id'],
    value_vars=['math_score', 'science_score', 'history_score'],
    var_name='subject',
    value_name='score'
)

print("\nLong format data:")
print(long_df)

Output:

Wide format data:
  student_id  math_score  science_score  history_score
       S1          85             90             75
       S2          78             82             85
       S3          92             85             80
       S4          88             94             88

Long format data:
   student_id        subject  score
        S1     math_score     85
        S2     math_score     78
        S3     math_score     92
        S4     math_score     88
        S1  science_score     90
        S2  science_score     82
        S3  science_score     85
        S4  science_score     94
        S1  history_score     75
        S2  history_score     85
       S3  history_score     80
       S4  history_score     88

Example 2: Cleaning Column Names After Melting

Sometimes we need to clean up the variable names after melting:

# Create a wide-format dataframe with more structured column names
wide_df = pd.DataFrame({
    'student_id': ['S1', 'S2', 'S3', 'S4'],
    'score_math_2022': [85, 78, 92, 88],
    'score_science_2022': [90, 82, 85, 94],
    'score_math_2023': [88, 80, 95, 90],
    'score_science_2023': [92, 85, 88, 96]
})

print("Wide format data:")
print(wide_df)

# Melt the dataframe
long_df = pd.melt(
    wide_df, 
    id_vars=['student_id'],
    value_vars=['score_math_2022', 'score_science_2022', 
                'score_math_2023', 'score_science_2023'],
    var_name='measurement',
    value_name='score'
)

# Extract subject and year from the measurement column
long_df[['prefix', 'subject', 'year']] = long_df['measurement'].str.split('_', expand=True)
long_df = long_df.drop(['prefix', 'measurement'], axis=1)

print("\nLong format data with extracted components:")
print(long_df)

Output:

Wide format data:
  student_id  score_math_2022  score_science_2022  score_math_2023  score_science_2023
       S1               85                  90               88                  92
       S2               78                  82               80                  85
       S3               92                  85               95                  88
       S4               88                  94               90                  96

Long format data with extracted components:
   student_id  score subject  year
        S1     85    math  2022
        S2     78    math  2022
        S3     92    math  2022
        S4     88    math  2022
        S1     90 science  2022
        S2     82 science  2022
        S3     85 science  2022
        S4     94 science  2022
        S1     88    math  2023
        S2     80    math  2023
       S3     95    math  2023
       S4     90    math  2023
       S1     92 science  2023
       S2     85 science  2023
       S3     88 science  2023
       S4     96 science  2023

Using `pd.wide_to_long()`

The wide_to_long() function is designed for a specific pattern of wide data where column names have a common format.

Basic Syntax

pd.wide_to_long(df, 
                stubnames=['common prefix'], 
                i=['id columns'], 
                j='name for j column')

Example: Wide to Long with Structured Column Names

import pandas as pd

# Create a wide-format dataframe with structured column names
wide_df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['John', 'Jane', 'Mike'],
    'height_2020': [175, 165, 180],
    'weight_2020': [70, 55, 80],
    'height_2021': [176, 166, 181],
    'weight_2021': [72, 56, 78],
    'height_2022': [177, 167, 182],
    'weight_2022': [73, 57, 77]
})

print("Wide format data:")
print(wide_df)

# Using wide_to_long
long_df = pd.wide_to_long(
    wide_df, 
    stubnames=['height', 'weight'], 
    i=['id', 'name'], 
    j='year'
)

print("\nLong format data:")
print(long_df)

# Reset index if needed
long_df_reset = long_df.reset_index()
print("\nLong format data with reset index:")
print(long_df_reset)

Output:

Wide format data:
   id  name  height_2020  weight_2020  height_2021  weight_2021  height_2022  weight_2022
 1  John          175           70          176           72          177           73
 2  Jane          165           55          166           56          167           57
 3  Mike          180           80          181           78          182           77

Long format data:
                height  weight
id name year                  
John 2020     175      70
Jane 2020     165      55
Mike 2020     180      80
John 2021     176      72
Jane 2021     166      56
Mike 2021     181      78
John 2022     177      73
Jane 2022     167      57
Mike 2022     182      77

Long format data with reset index:
   id  name  year  height  weight
 1  John  2020     175      70
 2  Jane  2020     165      55
 3  Mike  2020     180      80
 1  John  2021     176      72
 2  Jane  2021     166      56
 3  Mike  2021     181      78
 1  John  2022     177      73
 2  Jane  2022     167      57
 3  Mike  2022     182      77

Real-world Application: Analyzing Health Metrics Over Time

Let's see a practical example where reshaping from wide to long is useful:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample patient data in wide format
patients_wide = pd.DataFrame({
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'age': [45, 52, 35, 62, 41],
    'bp_jan': [130, 142, 125, 155, 135],
    'bp_feb': [128, 138, 127, 150, 132],
    'bp_mar': [129, 135, 124, 148, 130],
    'cholesterol_jan': [220, 245, 200, 255, 210],
    'cholesterol_feb': [215, 235, 205, 245, 208],
    'cholesterol_mar': [210, 230, 198, 240, 205]
})

print("Original patient data (wide format):")
print(patients_wide)

# First, melt blood pressure data
bp_long = pd.melt(
    patients_wide,
    id_vars=['patient_id', 'gender', 'age'],
    value_vars=['bp_jan', 'bp_feb', 'bp_mar'],
    var_name='month',
    value_name='blood_pressure'
)

# Clean up month column
bp_long['month'] = bp_long['month'].str.replace('bp_', '')

# Next, melt cholesterol data
chol_long = pd.melt(
    patients_wide,
    id_vars=['patient_id', 'gender', 'age'],
    value_vars=['cholesterol_jan', 'cholesterol_feb', 'cholesterol_mar'],
    var_name='month',
    value_name='cholesterol'
)

# Clean up month column
chol_long['month'] = chol_long['month'].str.replace('cholesterol_', '')

# Merge the melted dataframes
metrics_long = bp_long.merge(
    chol_long, 
    on=['patient_id', 'gender', 'age', 'month']
)

print("\nReshaped patient data (long format):")
print(metrics_long)

# Now we can easily perform analysis by month or gender
plt.figure(figsize=(12, 5))

# Create first subplot for blood pressure trends
plt.subplot(1, 2, 1)
sns.lineplot(data=metrics_long, x='month', y='blood_pressure', hue='gender', marker='o')
plt.title('Blood Pressure Trends by Gender')
plt.ylabel('Blood Pressure (mmHg)')

# Create second subplot for cholesterol trends
plt.subplot(1, 2, 2)
sns.lineplot(data=metrics_long, x='month', y='cholesterol', hue='gender', marker='o')
plt.title('Cholesterol Trends by Gender')
plt.ylabel('Cholesterol (mg/dL)')

plt.tight_layout()
# plt.show()  # Uncomment to display the plot

The resultant plots would show both blood pressure and cholesterol trends over the three months, separated by gender. This visualization wouldn't be as straightforward with the original wide-format data.

When to Use Wide vs. Long Format

Use wide format when:

You need a compact representation of data
You're creating summary tables for reporting
Each variable has its own significance and is rarely analyzed together

Use long format when:

You're using visualization libraries like Seaborn or ggplot
You need to perform statistical analysis across variables
You're working with time-series data
You need to apply operations uniformly across variables

Summary

In this tutorial, we explored two powerful Pandas functions for reshaping data from wide to long format:

pd.melt(): A flexible function that unpivots columns into rows
pd.wide_to_long(): A specialized function for structured column names

Converting data between wide and long formats is a fundamental skill in data preparation. The choice of format depends on your specific analysis needs, but long format is generally more versatile for statistical analysis and visualization.

Exercises for Practice

Take a dataset of student scores across multiple subjects and years in wide format and convert it to long format.
Create a wide-format dataset of weather data (temperature, humidity, etc.) for different cities and transform it to long format for easier visualization.
Use the wide_to_long() function to reshape a dataset with multiple measurements taken at different time points.
Convert a wide-format time series of stock prices for multiple companies to long format.

Additional Resources

Understanding how to reshape data efficiently will significantly enhance your data wrangling capabilities and make subsequent analysis much more straightforward!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What are Wide and Long Formats?​

Methods to Convert from Wide to Long​

Using pd.melt()​

Basic Syntax​

Example 1: Basic Melting​

Example 2: Cleaning Column Names After Melting​

Using pd.wide_to_long()​

Basic Syntax​

Example: Wide to Long with Structured Column Names​

Real-world Application: Analyzing Health Metrics Over Time​

When to Use Wide vs. Long Format​

Summary​

Exercises for Practice​

Additional Resources​

What are Wide and Long Formats?

Methods to Convert from Wide to Long

Using `pd.melt()`

Basic Syntax

Example 1: Basic Melting

Example 2: Cleaning Column Names After Melting

Using `pd.wide_to_long()`

Basic Syntax

Example: Wide to Long with Structured Column Names

Real-world Application: Analyzing Health Metrics Over Time

When to Use Wide vs. Long Format

Summary

Exercises for Practice

Additional Resources