Skip to main content

Pandas Hierarchical Groupby

When working with complex datasets in pandas, you often need to analyze data across multiple dimensions or categories simultaneously. This is where hierarchical groupby operations become invaluable. In this tutorial, we'll explore how to group data across multiple levels and perform aggregate operations efficiently.

Introduction to Hierarchical Groupby

Hierarchical groupby (also called multi-level groupby) extends pandas' standard groupby functionality by allowing you to group data across multiple columns or levels. This creates a hierarchical structure of groups and subgroups, enabling more nuanced analysis of your data.

Think of hierarchical groupby as creating a tree-like structure:

  • The first groupby column forms the top-level branches
  • Each additional column creates nested subdivisions
  • Aggregate functions can be applied at any level of the hierarchy

Basic Hierarchical Groupby Syntax

The syntax for a hierarchical groupby is straightforward—you simply pass a list of columns to the groupby() function:

python
df.groupby(['column1', 'column2', 'column3'])

Let's see this in action with a simple example:

python
import pandas as pd
import numpy as np

# Create sample data
data = {
'region': ['East', 'East', 'West', 'West', 'East', 'West'],
'product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple', 'Apple'],
'sales': [50, 40, 35, 30, 45, 25],
'quantity': [10, 8, 7, 6, 9, 5]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
region product sales quantity
0 East Apple 50 10
1 East Banana 40 8
2 West Apple 35 7
3 West Banana 30 6
4 East Apple 45 9
5 West Apple 25 5

Now, let's group by both 'region' and 'product':

python
# Create a hierarchical groupby
grouped = df.groupby(['region', 'product'])

# Calculate sum for each group
result = grouped.sum()
print("\nHierarchical GroupBy Sum:")
print(result)

Output:

Hierarchical GroupBy Sum:
sales quantity
region product
East Apple 95 19
Banana 40 8
West Apple 60 12
Banana 30 6

Understanding the Hierarchical Index

Notice that the result has a multi-level index (also called a hierarchical index). The first level is 'region' and the second level is 'product'. This structure reflects our grouping hierarchy.

You can access specific groups using the hierarchical indexing:

python
# Access a specific group
print("\nData for East region, Apple products:")
print(result.loc['East', 'Apple'])

Output:

Data for East region, Apple products:
sales 95
quantity 19
Name: (East, Apple), dtype: int64

Aggregate Functions with Hierarchical Groupby

Just like regular groupby operations, you can apply various aggregate functions to your hierarchically grouped data:

python
# Multiple aggregations
agg_result = grouped.agg({
'sales': ['sum', 'mean', 'max'],
'quantity': ['sum', 'mean']
})

print("\nMultiple Aggregations:")
print(agg_result)

Output:

Multiple Aggregations:
sales quantity
sum mean max sum mean
region product
East Apple 95 47.5 50 19 9.5
Banana 40 40.0 40 8 8.0
West Apple 60 30.0 35 12 6.0
Banana 30 30.0 30 6 6.0

Unstack for Better Visualization

Sometimes, it's easier to analyze hierarchical data when it's presented in a matrix form. The unstack() method transforms a level of the hierarchical index into columns:

python
# Unstack the product level
unstacked = grouped.sum().unstack('product')
print("\nUnstacked Result:")
print(unstacked)

Output:

Unstacked Result:
sales quantity
product Apple Banana Apple Banana
region
East 95 40 19 8
West 60 30 12 6

This format makes it easier to compare products within each region at a glance.

Real-World Example: Sales Data Analysis

Let's apply hierarchical groupby to a more realistic sales dataset:

python
# Create a more comprehensive sales dataset
sales_data = {
'date': pd.date_range('2023-01-01', periods=20),
'region': np.random.choice(['North', 'South', 'East', 'West'], 20),
'category': np.random.choice(['Electronics', 'Clothing', 'Food'], 20),
'product': np.random.choice(['Product A', 'Product B', 'Product C'], 20),
'sales': np.random.randint(100, 1000, 20),
'units': np.random.randint(1, 50, 20)
}

sales_df = pd.DataFrame(sales_data)
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df['month'] = sales_df['date'].dt.month_name()

print("Sales Dataset (Sample):")
print(sales_df.head())

Now, let's analyze monthly sales by region and product category:

python
# Group by month, region, and category
monthly_sales = sales_df.groupby(['month', 'region', 'category']).agg({
'sales': ['sum', 'mean'],
'units': 'sum'
})

print("\nMonthly Sales Analysis:")
print(monthly_sales)

# Let's focus on January sales, unstacked by region
jan_sales = monthly_sales.loc['January'].unstack('region')
print("\nJanuary Sales by Region:")
print(jan_sales)

Filtering Groups in Hierarchical Groupby

We can also filter groups based on aggregate conditions:

python
# Filter to find regions and categories with total sales > 1000
high_sales_groups = sales_df.groupby(['region', 'category']).filter(lambda x: x['sales'].sum() > 1000)

print("\nHigh Sales Groups:")
print(high_sales_groups.groupby(['region', 'category']).sum())

Using GroupBy.transform() with Hierarchical Groups

The transform() method can apply functions that return data with the same shape as the input, which is useful for operations like normalization within groups:

python
# Add percentage of region total
sales_df['region_pct'] = sales_df.groupby(['region', 'category'])['sales'].transform(
lambda x: x / x.sum() * 100
)

print("\nSales with Percentage of Region-Category Total:")
print(sales_df[['region', 'category', 'sales', 'region_pct']].head())

Practical Tips for Hierarchical Groupby

Here are some useful tips when working with hierarchical groupby:

1. Managing the Level Order

You can change the order of levels in a hierarchical index:

python
# Swap index levels
result_swapped = result.swaplevel('region', 'product')
print("\nSwapped Levels:")
print(result_swapped)

2. Using get_group()

For direct access to a specific group:

python
# Get a specific group directly
east_apple_group = grouped.get_group(('East', 'Apple'))
print("\nEast Apple Group:")
print(east_apple_group)

3. Sorting the Index

For better readability:

python
# Sort index for clearer presentation
sorted_result = result.sort_index()
print("\nSorted Result:")
print(sorted_result)

Summary

Hierarchical groupby in pandas provides a powerful way to analyze multi-dimensional data:

  • Group data by multiple columns using df.groupby([col1, col2, ...])
  • The result has a multi-level index reflecting your grouping structure
  • Access specific groups using hierarchical indexing
  • Transform hierarchical results using unstack() for better visualization
  • Apply multiple aggregations to different columns
  • Filter, transform, and manipulate hierarchically grouped data

This technique is particularly valuable when analyzing complex datasets with multiple categories, dimensions, or time periods.

Exercises

  1. Create a dataset with information about students, courses, and grades. Use hierarchical groupby to find the average grade by department and course.

  2. Using a sales dataset, group by year, quarter, and product category to analyze seasonal trends.

  3. Practice using unstack() at different levels to reshape hierarchical groupby results.

  4. Experiment with different aggregation functions on a hierarchically grouped dataset.

Additional Resources

By mastering hierarchical groupby operations, you'll greatly enhance your ability to extract insights from complex, multi-dimensional datasets using pandas.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)