Skip to main content

Pandas Pivot

When working with data in pandas, you'll often need to reorganize or restructure it to make it more suitable for analysis or visualization. The pivot() function is a powerful tool that allows you to reshape your data by specifying which columns should become the new index, columns, and values.

What is Pivot?

The pivot operation allows you to transform your data from a "long" format to a "wide" format. It's similar to creating pivot tables in spreadsheet applications like Excel. This operation is particularly useful when you want to:

  • Reorganize your data for better visualization
  • Prepare your data for specific types of analysis
  • Convert stacked data into a more readable tabular format

Basic Syntax

The basic syntax of the pivot() function is:

python
DataFrame.pivot(index=None, columns=None, values=None)

Where:

  • index: Column to use for the resulting table's index
  • columns: Column to use for the resulting table's columns
  • values: Column(s) to use for populating the values in the resulting table

Simple Example

Let's start with a basic example to understand how pivoting works:

python
import pandas as pd
import numpy as np

# Create a simple DataFrame
data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product': ['A', 'B', 'A', 'B'],
'sales': [100, 150, 120, 180]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
date product sales
0 2023-01-01 A 100
1 2023-01-01 B 150
2 2023-01-02 A 120
3 2023-01-02 B 180

Now, let's pivot this data to have dates as rows, products as columns, and sales as values:

python
pivoted_df = df.pivot(index='date', columns='product', values='sales')
print("\nPivoted DataFrame:")
print(pivoted_df)

Output:

Pivoted DataFrame:
product A B
date
2023-01-01 100.0 150.0
2023-01-02 120.0 180.0

The pivot operation has restructured our data so that:

  • The unique values from the date column became the index
  • The unique values from the product column became the column headers
  • The corresponding values from the sales column filled the table

Handling Duplicate Values

One limitation of the pivot() function is that it requires unique combinations of index and column values. If there are duplicate combinations, you'll get a ValueError.

Let's see what happens if we have duplicate data:

python
# Create a DataFrame with duplicate combinations
data_with_duplicates = {
'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02'],
'product': ['A', 'A', 'B', 'B'],
'sales': [100, 110, 150, 180]
}

df_duplicates = pd.DataFrame(data_with_duplicates)
print("DataFrame with duplicates:")
print(df_duplicates)

Output:

DataFrame with duplicates:
date product sales
0 2023-01-01 A 100
1 2023-01-01 A 110
2 2023-01-01 B 150
3 2023-01-02 B 180

If we try to pivot this data:

python
try:
df_duplicates.pivot(index='date', columns='product', values='sales')
except ValueError as e:
print(f"\nError: {e}")

Output:

Error: Index contains duplicate entries, cannot reshape

To handle duplicates, you can use the pivot_table() function instead, which can aggregate duplicate values:

python
pivot_table_df = df_duplicates.pivot_table(index='date', columns='product', values='sales', aggfunc='mean')
print("\nPivot Table (with mean aggregation):")
print(pivot_table_df)

Output:

Pivot Table (with mean aggregation):
product A B
date
2023-01-01 105.0 150.0
2023-01-02 NaN 180.0

Multiple Value Columns

You can also pivot multiple value columns at once:

python
# Create DataFrame with multiple value columns
data_multi = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product': ['A', 'B', 'A', 'B'],
'sales': [100, 150, 120, 180],
'units': [10, 15, 12, 18]
}

df_multi = pd.DataFrame(data_multi)
print("DataFrame with multiple value columns:")
print(df_multi)

Output:

DataFrame with multiple value columns:
date product sales units
0 2023-01-01 A 100 10
1 2023-01-01 B 150 15
2 2023-01-02 A 120 12
3 2023-01-02 B 180 18
python
# To pivot with multiple value columns, we need to use pivot_table
multi_pivot = df_multi.pivot_table(index='date', columns='product')
print("\nPivoted DataFrame with multiple value columns:")
print(multi_pivot)

Output:

Pivoted DataFrame with multiple value columns:
sales units
product A B A B
date
2023-01-01 100.0 150.0 10.0 15.0
2023-01-02 120.0 180.0 12.0 18.0

Notice how this creates a hierarchical column structure with sales and units as the top level, and the product types as the second level.

Real-World Example: Temperature Analysis

Let's look at a more practical example. Suppose you have temperature data for different cities over several days, and you want to analyze and visualize it effectively.

python
# Temperature data for different cities
temp_data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-01',
'2023-01-02', '2023-01-02', '2023-01-02',
'2023-01-03', '2023-01-03', '2023-01-03'],
'city': ['New York', 'Chicago', 'San Francisco',
'New York', 'Chicago', 'San Francisco',
'New York', 'Chicago', 'San Francisco'],
'temperature': [32, 28, 60, 30, 25, 59, 31, 27, 58]
}

temp_df = pd.DataFrame(temp_data)
print("Temperature Data:")
print(temp_df)

Output:

Temperature Data:
date city temperature
0 2023-01-01 New York 32
1 2023-01-01 Chicago 28
2 2023-01-01 San Francisco 60
3 2023-01-02 New York 30
4 2023-01-02 Chicago 25
5 2023-01-02 San Francisco 59
6 2023-01-03 New York 31
7 2023-01-03 Chicago 27
8 2023-01-03 San Francisco 58

Let's pivot this to get a clearer view of temperature trends by city:

python
temp_pivoted = temp_df.pivot(index='date', columns='city', values='temperature')
print("\nPivoted Temperature Data:")
print(temp_pivoted)

Output:

Pivoted Temperature Data:
city Chicago New York San Francisco
date
2023-01-01 28 32 60
2023-01-02 25 30 59
2023-01-03 27 31 58

Now you can easily compare temperatures across cities and days. You might visualize this data with:

python
import matplotlib.pyplot as plt

# Plot temperature trends
temp_pivoted.plot(kind='line', marker='o', figsize=(10, 6))
plt.title('Temperature Trends by City')
plt.ylabel('Temperature (°F)')
plt.grid(True)
plt.tight_layout()

Reshaping Back: From Wide to Long Format

After pivoting data from long to wide format, you might want to convert it back. You can use the melt() function for this purpose, which is essentially the reverse operation of pivot:

python
# Starting with our pivoted temperature data
print("Pivoted (wide) format:")
print(temp_pivoted)

# Melt back to long format
melted_df = temp_pivoted.reset_index().melt(id_vars='date',
value_name='temperature',
var_name='city')
print("\nMelted (long) format:")
print(melted_df)

Output:

Pivoted (wide) format:
city Chicago New York San Francisco
date
2023-01-01 28 32 60
2023-01-02 25 30 59
2023-01-03 27 31 58

Melted (long) format:
date city temperature
0 2023-01-01 Chicago 28
1 2023-01-02 Chicago 25
2 2023-01-03 Chicago 27
3 2023-01-01 New York 32
4 2023-01-02 New York 30
5 2023-01-03 New York 31
6 2023-01-01 San Francisco 60
7 2023-01-02 San Francisco 59
8 2023-01-03 San Francisco 58

Practical Use Case: Sales Analysis

Let's explore a more complex example analyzing sales data:

python
# Create sales data
sales_data = {
'date': pd.date_range(start='2023-01-01', periods=12, freq='D').repeat(3),
'product': ['Laptop', 'Phone', 'Tablet'] * 12,
'region': ['North', 'South', 'East', 'West'] * 9,
'sales': np.random.randint(5, 50, size=36),
'units': np.random.randint(1, 10, size=36)
}

sales_df = pd.DataFrame(sales_data)
print("Sales Data Sample:")
print(sales_df.head(10))

We can pivot this data to analyze sales by date and product:

python
# Pivot to analyze daily sales by product
daily_product_sales = sales_df.pivot_table(
index='date',
columns='product',
values='sales',
aggfunc='sum'
)

print("\nDaily Product Sales:")
print(daily_product_sales.head())

# Pivot to analyze regional sales by product
regional_product_sales = sales_df.pivot_table(
index='region',
columns='product',
values=['sales', 'units'],
aggfunc='sum'
)

print("\nRegional Product Sales:")
print(regional_product_sales)

This type of reshaping makes it much easier to spot trends, compare performance across categories, and create meaningful visualizations.

Summary

The pivot() function in pandas is a powerful tool for reshaping data from long to wide format. It allows you to:

  • Reorganize your data for better readability and analysis
  • Transform data for visualization
  • Create cross-tabulations similar to Excel pivot tables

Key points to remember:

  • Basic syntax: DataFrame.pivot(index, columns, values)
  • Requires unique combinations of index and column values
  • Use pivot_table() when you have duplicate entries that need aggregation
  • Can handle multiple value columns, creating hierarchical column structures
  • Pairs well with melt() for converting back from wide to long format

Exercises

  1. Practice pivoting a dataset with temperature data for different cities over time.
  2. Create a dataset with sales information and use pivot to analyze sales by product and region.
  3. Use pivot_table() with different aggregation functions (sum, mean, max) on a dataset with duplicate entries.
  4. Pivot a dataset with multiple value columns and analyze the resulting hierarchical column structure.
  5. Use pivot to prepare data for a visualization showing trends over time.

Additional Resources

Happy pivoting!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)