Pandas Pivot
When working with data in pandas, you'll often need to reorganize or restructure it to make it more suitable for analysis or visualization. The pivot()
function is a powerful tool that allows you to reshape your data by specifying which columns should become the new index, columns, and values.
What is Pivot?
The pivot operation allows you to transform your data from a "long" format to a "wide" format. It's similar to creating pivot tables in spreadsheet applications like Excel. This operation is particularly useful when you want to:
- Reorganize your data for better visualization
- Prepare your data for specific types of analysis
- Convert stacked data into a more readable tabular format
Basic Syntax
The basic syntax of the pivot()
function is:
DataFrame.pivot(index=None, columns=None, values=None)
Where:
index
: Column to use for the resulting table's indexcolumns
: Column to use for the resulting table's columnsvalues
: Column(s) to use for populating the values in the resulting table
Simple Example
Let's start with a basic example to understand how pivoting works:
import pandas as pd
import numpy as np
# Create a simple DataFrame
data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product': ['A', 'B', 'A', 'B'],
'sales': [100, 150, 120, 180]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
date product sales
0 2023-01-01 A 100
1 2023-01-01 B 150
2 2023-01-02 A 120
3 2023-01-02 B 180
Now, let's pivot this data to have dates as rows, products as columns, and sales as values:
pivoted_df = df.pivot(index='date', columns='product', values='sales')
print("\nPivoted DataFrame:")
print(pivoted_df)
Output:
Pivoted DataFrame:
product A B
date
2023-01-01 100.0 150.0
2023-01-02 120.0 180.0
The pivot operation has restructured our data so that:
- The unique values from the
date
column became the index - The unique values from the
product
column became the column headers - The corresponding values from the
sales
column filled the table
Handling Duplicate Values
One limitation of the pivot()
function is that it requires unique combinations of index and column values. If there are duplicate combinations, you'll get a ValueError
.
Let's see what happens if we have duplicate data:
# Create a DataFrame with duplicate combinations
data_with_duplicates = {
'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02'],
'product': ['A', 'A', 'B', 'B'],
'sales': [100, 110, 150, 180]
}
df_duplicates = pd.DataFrame(data_with_duplicates)
print("DataFrame with duplicates:")
print(df_duplicates)
Output:
DataFrame with duplicates:
date product sales
0 2023-01-01 A 100
1 2023-01-01 A 110
2 2023-01-01 B 150
3 2023-01-02 B 180
If we try to pivot this data:
try:
df_duplicates.pivot(index='date', columns='product', values='sales')
except ValueError as e:
print(f"\nError: {e}")
Output:
Error: Index contains duplicate entries, cannot reshape
To handle duplicates, you can use the pivot_table()
function instead, which can aggregate duplicate values:
pivot_table_df = df_duplicates.pivot_table(index='date', columns='product', values='sales', aggfunc='mean')
print("\nPivot Table (with mean aggregation):")
print(pivot_table_df)
Output:
Pivot Table (with mean aggregation):
product A B
date
2023-01-01 105.0 150.0
2023-01-02 NaN 180.0
Multiple Value Columns
You can also pivot multiple value columns at once:
# Create DataFrame with multiple value columns
data_multi = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product': ['A', 'B', 'A', 'B'],
'sales': [100, 150, 120, 180],
'units': [10, 15, 12, 18]
}
df_multi = pd.DataFrame(data_multi)
print("DataFrame with multiple value columns:")
print(df_multi)
Output:
DataFrame with multiple value columns:
date product sales units
0 2023-01-01 A 100 10
1 2023-01-01 B 150 15
2 2023-01-02 A 120 12
3 2023-01-02 B 180 18
# To pivot with multiple value columns, we need to use pivot_table
multi_pivot = df_multi.pivot_table(index='date', columns='product')
print("\nPivoted DataFrame with multiple value columns:")
print(multi_pivot)
Output:
Pivoted DataFrame with multiple value columns:
sales units
product A B A B
date
2023-01-01 100.0 150.0 10.0 15.0
2023-01-02 120.0 180.0 12.0 18.0
Notice how this creates a hierarchical column structure with sales
and units
as the top level, and the product types as the second level.
Real-World Example: Temperature Analysis
Let's look at a more practical example. Suppose you have temperature data for different cities over several days, and you want to analyze and visualize it effectively.
# Temperature data for different cities
temp_data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-01',
'2023-01-02', '2023-01-02', '2023-01-02',
'2023-01-03', '2023-01-03', '2023-01-03'],
'city': ['New York', 'Chicago', 'San Francisco',
'New York', 'Chicago', 'San Francisco',
'New York', 'Chicago', 'San Francisco'],
'temperature': [32, 28, 60, 30, 25, 59, 31, 27, 58]
}
temp_df = pd.DataFrame(temp_data)
print("Temperature Data:")
print(temp_df)
Output:
Temperature Data:
date city temperature
0 2023-01-01 New York 32
1 2023-01-01 Chicago 28
2 2023-01-01 San Francisco 60
3 2023-01-02 New York 30
4 2023-01-02 Chicago 25
5 2023-01-02 San Francisco 59
6 2023-01-03 New York 31
7 2023-01-03 Chicago 27
8 2023-01-03 San Francisco 58
Let's pivot this to get a clearer view of temperature trends by city:
temp_pivoted = temp_df.pivot(index='date', columns='city', values='temperature')
print("\nPivoted Temperature Data:")
print(temp_pivoted)
Output:
Pivoted Temperature Data:
city Chicago New York San Francisco
date
2023-01-01 28 32 60
2023-01-02 25 30 59
2023-01-03 27 31 58
Now you can easily compare temperatures across cities and days. You might visualize this data with:
import matplotlib.pyplot as plt
# Plot temperature trends
temp_pivoted.plot(kind='line', marker='o', figsize=(10, 6))
plt.title('Temperature Trends by City')
plt.ylabel('Temperature (°F)')
plt.grid(True)
plt.tight_layout()
Reshaping Back: From Wide to Long Format
After pivoting data from long to wide format, you might want to convert it back. You can use the melt()
function for this purpose, which is essentially the reverse operation of pivot:
# Starting with our pivoted temperature data
print("Pivoted (wide) format:")
print(temp_pivoted)
# Melt back to long format
melted_df = temp_pivoted.reset_index().melt(id_vars='date',
value_name='temperature',
var_name='city')
print("\nMelted (long) format:")
print(melted_df)
Output:
Pivoted (wide) format:
city Chicago New York San Francisco
date
2023-01-01 28 32 60
2023-01-02 25 30 59
2023-01-03 27 31 58
Melted (long) format:
date city temperature
0 2023-01-01 Chicago 28
1 2023-01-02 Chicago 25
2 2023-01-03 Chicago 27
3 2023-01-01 New York 32
4 2023-01-02 New York 30
5 2023-01-03 New York 31
6 2023-01-01 San Francisco 60
7 2023-01-02 San Francisco 59
8 2023-01-03 San Francisco 58
Practical Use Case: Sales Analysis
Let's explore a more complex example analyzing sales data:
# Create sales data
sales_data = {
'date': pd.date_range(start='2023-01-01', periods=12, freq='D').repeat(3),
'product': ['Laptop', 'Phone', 'Tablet'] * 12,
'region': ['North', 'South', 'East', 'West'] * 9,
'sales': np.random.randint(5, 50, size=36),
'units': np.random.randint(1, 10, size=36)
}
sales_df = pd.DataFrame(sales_data)
print("Sales Data Sample:")
print(sales_df.head(10))
We can pivot this data to analyze sales by date and product:
# Pivot to analyze daily sales by product
daily_product_sales = sales_df.pivot_table(
index='date',
columns='product',
values='sales',
aggfunc='sum'
)
print("\nDaily Product Sales:")
print(daily_product_sales.head())
# Pivot to analyze regional sales by product
regional_product_sales = sales_df.pivot_table(
index='region',
columns='product',
values=['sales', 'units'],
aggfunc='sum'
)
print("\nRegional Product Sales:")
print(regional_product_sales)
This type of reshaping makes it much easier to spot trends, compare performance across categories, and create meaningful visualizations.
Summary
The pivot()
function in pandas is a powerful tool for reshaping data from long to wide format. It allows you to:
- Reorganize your data for better readability and analysis
- Transform data for visualization
- Create cross-tabulations similar to Excel pivot tables
Key points to remember:
- Basic syntax:
DataFrame.pivot(index, columns, values)
- Requires unique combinations of index and column values
- Use
pivot_table()
when you have duplicate entries that need aggregation - Can handle multiple value columns, creating hierarchical column structures
- Pairs well with
melt()
for converting back from wide to long format
Exercises
- Practice pivoting a dataset with temperature data for different cities over time.
- Create a dataset with sales information and use pivot to analyze sales by product and region.
- Use
pivot_table()
with different aggregation functions (sum, mean, max) on a dataset with duplicate entries. - Pivot a dataset with multiple value columns and analyze the resulting hierarchical column structure.
- Use pivot to prepare data for a visualization showing trends over time.
Additional Resources
- Pandas Documentation on pivot
- Pandas Documentation on pivot_table
- Reshaping and pivot tables in pandas
Happy pivoting!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)