Pandas Reshape Methods
Data rarely comes in the exact format we need for analysis. Pandas provides powerful reshape methods that allow you to transform your data from one format to another. In this tutorial, we'll explore various techniques to reshape your data efficiently.
Introduction to Data Reshaping
Data reshaping involves rearranging data between wide format (where each variable has its own column) and long format (where variables are stored in rows). These operations are essential for:
- Data cleaning and preparation
- Statistical modeling
- Visualization
- Feature engineering
Let's dive into the key reshaping methods in Pandas!
Prerequisites
To follow this tutorial, you need:
- Basic knowledge of Python
- Familiarity with Pandas DataFrames and Series
- Pandas installed in your environment
import pandas as pd
import numpy as np
Pivot: From Long to Wide Format
The pivot()
method transforms long data into wide format by using unique values from one column as new column headers.
Basic Pivot Example
# Create a sample DataFrame in long format
data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'city': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'temperature': [32, 65, 30, 68]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
date city temperature
0 2023-01-01 New York 32
1 2023-01-01 Los Angeles 65
2 2023-01-02 New York 30
3 2023-01-02 Los Angeles 68
Now let's pivot this data to create a new DataFrame where cities become columns:
# Pivot the data: days as rows, cities as columns
pivoted = df.pivot(index='date', columns='city', values='temperature')
print("\nPivoted DataFrame:")
print(pivoted)
Output:
Pivoted DataFrame:
city Los Angeles New York
date
2023-01-01 65 32
2023-01-02 68 30
Pivot with Multiple Values
You can pivot multiple value columns at once by omitting the values
parameter:
# Add another measurement column
df['humidity'] = [80, 50, 85, 55]
print("\nDataFrame with multiple measurements:")
print(df)
# Pivot with multiple values
pivoted_multi = df.pivot(index='date', columns='city')
print("\nPivoted with multiple values:")
print(pivoted_multi)
Output:
DataFrame with multiple measurements:
date city temperature humidity
0 2023-01-01 New York 32 80
1 2023-01-01 Los Angeles 65 50
2 2023-01-02 New York 30 85
3 2023-01-02 Los Angeles 68 55
Pivoted with multiple values:
temperature humidity
city Los Angeles New York Los Angeles New York
date
2023-01-01 65 32 50 80
2023-01-02 68 30 55 85
Handling Duplicate Entries
If there are duplicate entries in your index and columns, pivot()
will raise an error. To handle duplicates, use pivot_table()
which can aggregate duplicate values:
# Create data with duplicates
data_dupe = {
'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02'],
'city': ['New York', 'New York', 'Los Angeles', 'Los Angeles'],
'temperature': [32, 34, 65, 68]
}
df_dupe = pd.DataFrame(data_dupe)
print("DataFrame with duplicates:")
print(df_dupe)
# Use pivot_table with aggregation
pivoted_table = df_dupe.pivot_table(index='date', columns='city', values='temperature', aggfunc='mean')
print("\nPivoted table with aggregation:")
print(pivoted_table)
Output:
DataFrame with duplicates:
date city temperature
0 2023-01-01 New York 32
1 2023-01-01 New York 34
2 2023-01-01 Los Angeles 65
3 2023-01-02 Los Angeles 68
Pivoted table with aggregation:
city Los Angeles New York
date
2023-01-01 65.0 33.0
2023-01-02 68.0 NaN
Melt: From Wide to Long Format
The melt()
function does the opposite of pivot()
- it transforms wide-format data into long format.
Basic Melt Example
# Create a sample wide-format DataFrame
wide_df = pd.DataFrame({
'name': ['John', 'Mary', 'Bob'],
'math': [90, 85, 92],
'science': [88, 95, 79],
'history': [76, 82, 89]
})
print("Original wide DataFrame:")
print(wide_df)
# Melt the DataFrame
melted = wide_df.melt(id_vars=['name'], var_name='subject', value_name='score')
print("\nMelted DataFrame:")
print(melted)
Output:
Original wide DataFrame:
name math science history
0 John 90 88 76
1 Mary 85 95 82
2 Bob 92 79 89
Melted DataFrame:
name subject score
0 John math 90
1 Mary math 85
2 Bob math 92
3 John science 88
4 Mary science 95
5 Bob science 79
6 John history 76
7 Mary history 82
8 Bob history 89
Selective Melting
You can selectively melt only specific columns:
# Melt only math and science columns
melted_selective = wide_df.melt(
id_vars=['name'],
value_vars=['math', 'science'],
var_name='subject',
value_name='score'
)
print("\nSelectively melted DataFrame:")
print(melted_selective)
Output:
Selectively melted DataFrame:
name subject score
0 John math 90
1 Mary math 85
2 Bob math 92
3 John science 88
4 Mary science 95
5 Bob science 79
Stack and Unstack: Working with MultiIndex
The stack()
and unstack()
methods reshape data by moving the innermost column index to become the innermost row index (stacking) or vice versa (unstacking).
Stacking Example
# Create a multi-column DataFrame
multi_df = pd.DataFrame({
('A', 'one'): [1, 2, 3],
('A', 'two'): [4, 5, 6],
('B', 'one'): [7, 8, 9],
('B', 'two'): [10, 11, 12]
})
multi_df.columns = pd.MultiIndex.from_tuples(multi_df.columns)
print("Multi-column DataFrame:")
print(multi_df)
# Stack the DataFrame
stacked = multi_df.stack()
print("\nStacked DataFrame:")
print(stacked)
Output:
Multi-column DataFrame:
A B
one two one two
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
Stacked DataFrame:
A B
0 one 1 7
two 4 10
1 one 2 8
two 5 11
2 one 3 9
two 6 12
Unstacking Example
You can also unstack a stacked DataFrame:
# Unstack the first level
unstacked = stacked.unstack(0)
print("\nUnstacked DataFrame (level 0):")
print(unstacked)
# Unstack the first level from original stacked
unstacked_level1 = stacked.unstack(1)
print("\nUnstacked DataFrame (level 1):")
print(unstacked_level1)
Output:
Unstacked DataFrame (level 0):
0 1 2
A one 1 2 3
two 4 5 6
B one 7 8 9
two 10 11 12
Unstacked DataFrame (level 1):
one two
0 A 1 4
B 7 10
1 A 2 5
B 8 11
2 A 3 6
B 9 12
Real-World Application: Data Analysis for Monthly Sales
Let's explore a practical example using reshape methods for sales data analysis:
# Create a sample sales dataset
data = {
'date': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-15', '2023-03-05', '2023-03-20'],
'product': ['A', 'B', 'A', 'B', 'A', 'B'],
'region': ['East', 'East', 'West', 'West', 'East', 'West'],
'sales': [200, 150, 300, 250, 220, 180]
}
sales_df = pd.DataFrame(data)
# Convert date to datetime
sales_df['date'] = pd.to_datetime(sales_df['date'])
# Extract month for easier aggregation
sales_df['month'] = sales_df['date'].dt.strftime('%Y-%m')
print("Original sales data:")
print(sales_df)
# Reshape the data to analyze monthly sales by product
monthly_product_sales = sales_df.pivot_table(
index='month',
columns='product',
values='sales',
aggfunc='sum'
)
print("\nMonthly sales by product:")
print(monthly_product_sales)
# Reshape to analyze regional performance by product
regional_product_sales = sales_df.pivot_table(
index='region',
columns='product',
values='sales',
aggfunc='sum'
)
print("\nRegional sales by product:")
print(regional_product_sales)
# Convert wide format to long format for visualization
monthly_sales_long = monthly_product_sales.reset_index().melt(
id_vars=['month'],
var_name='product',
value_name='total_sales'
)
print("\nMonthly sales in long format (ready for visualization):")
print(monthly_sales_long)
Output:
Original sales data:
date product region sales month
0 2023-01-15 A East 200 2023-01
1 2023-01-20 B East 150 2023-01
2 2023-02-10 A West 300 2023-02
3 2023-02-15 B West 250 2023-02
4 2023-03-05 A East 220 2023-03
5 2023-03-20 B West 180 2023-03
Monthly sales by product:
product A B
month
2023-01 200 150
2023-02 300 250
2023-03 220 180
Regional sales by product:
product A B
region
East 420 150
West 300 430
Monthly sales in long format (ready for visualization):
month product total_sales
0 2023-01 A 200
1 2023-02 A 300
2 2023-03 A 220
3 2023-01 B 150
4 2023-02 B 250
5 2023-03 B 180
Summary
In this tutorial, we explored Pandas' powerful methods for reshaping data:
- Pivot: Transforms data from long format to wide format
- Pivot_table: Similar to pivot but handles duplicate entries through aggregation
- Melt: Transforms data from wide format to long format
- Stack/Unstack: Reshapes data by manipulating levels in MultiIndex DataFrames
Choosing the right reshape method depends on your data structure and analysis needs. These techniques are essential for data cleaning, transformation, and preparation for visualization or modeling.
Exercises
To solidify your understanding, try these exercises:
-
Create a DataFrame with daily weather data (date, city, temperature, humidity) and convert it to a wide format with cities as columns.
-
Start with a DataFrame of student grades by subject (student, math, science, history) and convert it to long format.
-
Create a MultiIndex DataFrame with sales data (year, quarter, product, revenue) and use stack/unstack to explore different views of the data.
-
Take a real dataset of your choice and identify a reshaping operation that would make analysis easier.
Additional Resources
- Pandas Documentation on Reshaping
- Pandas Pivot Table Documentation
- Pandas Melt Documentation
- Data Wrangling with Pandas Cheat Sheet
Happy data reshaping!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)