Pandas Hierarchical Groupby

When working with complex datasets in pandas, you often need to analyze data across multiple dimensions or categories simultaneously. This is where hierarchical groupby operations become invaluable. In this tutorial, we'll explore how to group data across multiple levels and perform aggregate operations efficiently.

Introduction to Hierarchical Groupby

Hierarchical groupby (also called multi-level groupby) extends pandas' standard groupby functionality by allowing you to group data across multiple columns or levels. This creates a hierarchical structure of groups and subgroups, enabling more nuanced analysis of your data.

Think of hierarchical groupby as creating a tree-like structure:

The first groupby column forms the top-level branches
Each additional column creates nested subdivisions
Aggregate functions can be applied at any level of the hierarchy

Basic Hierarchical Groupby Syntax

The syntax for a hierarchical groupby is straightforward—you simply pass a list of columns to the groupby() function:

python
df.groupby(['column1', 'column2', 'column3'])

Let's see this in action with a simple example:

python
import pandas as pd
import numpy as np

# Create sample data
data = {
    'region': ['East', 'East', 'West', 'West', 'East', 'West'],
    'product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple', 'Apple'],
    'sales': [50, 40, 35, 30, 45, 25],
    'quantity': [10, 8, 7, 6, 9, 5]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
  region product  sales  quantity
 East   Apple     50        10
 East  Banana     40         8
 West   Apple     35         7
 West  Banana     30         6
 East   Apple     45         9
 West   Apple     25         5

Now, let's group by both 'region' and 'product':

python
# Create a hierarchical groupby
grouped = df.groupby(['region', 'product'])

# Calculate sum for each group
result = grouped.sum()
print("\nHierarchical GroupBy Sum:")
print(result)

Output:

Hierarchical GroupBy Sum:
               sales  quantity
region product                
East   Apple      95        19
       Banana     40         8
West   Apple      60        12
       Banana     30         6

Understanding the Hierarchical Index

Notice that the result has a multi-level index (also called a hierarchical index). The first level is 'region' and the second level is 'product'. This structure reflects our grouping hierarchy.

You can access specific groups using the hierarchical indexing:

python
# Access a specific group
print("\nData for East region, Apple products:")
print(result.loc['East', 'Apple'])

Output:

Data for East region, Apple products:
sales       95
quantity    19
Name: (East, Apple), dtype: int64

Aggregate Functions with Hierarchical Groupby

Just like regular groupby operations, you can apply various aggregate functions to your hierarchically grouped data:

python
# Multiple aggregations
agg_result = grouped.agg({
    'sales': ['sum', 'mean', 'max'],
    'quantity': ['sum', 'mean']
})

print("\nMultiple Aggregations:")
print(agg_result)

Output:

Multiple Aggregations:
               sales            quantity      
                 sum  mean max      sum  mean
region product                               
East   Apple      95  47.5  50       19   9.5
       Banana     40  40.0  40        8   8.0
West   Apple      60  30.0  35       12   6.0
       Banana     30  30.0  30        6   6.0

Unstack for Better Visualization

Sometimes, it's easier to analyze hierarchical data when it's presented in a matrix form. The unstack() method transforms a level of the hierarchical index into columns:

python
# Unstack the product level
unstacked = grouped.sum().unstack('product')
print("\nUnstacked Result:")
print(unstacked)

Output:

Unstacked Result:
       sales          quantity       
product Apple Banana  Apple Banana
region                             
East       95     40      19      8
West       60     30      12      6

This format makes it easier to compare products within each region at a glance.

Real-World Example: Sales Data Analysis

Let's apply hierarchical groupby to a more realistic sales dataset:

python
# Create a more comprehensive sales dataset
sales_data = {
    'date': pd.date_range('2023-01-01', periods=20),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 20),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food'], 20),
    'product': np.random.choice(['Product A', 'Product B', 'Product C'], 20),
    'sales': np.random.randint(100, 1000, 20),
    'units': np.random.randint(1, 50, 20)
}

sales_df = pd.DataFrame(sales_data)
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df['month'] = sales_df['date'].dt.month_name()

print("Sales Dataset (Sample):")
print(sales_df.head())

Now, let's analyze monthly sales by region and product category:

python
# Group by month, region, and category
monthly_sales = sales_df.groupby(['month', 'region', 'category']).agg({
    'sales': ['sum', 'mean'],
    'units': 'sum'
})

print("\nMonthly Sales Analysis:")
print(monthly_sales)

# Let's focus on January sales, unstacked by region
jan_sales = monthly_sales.loc['January'].unstack('region')
print("\nJanuary Sales by Region:")
print(jan_sales)

Filtering Groups in Hierarchical Groupby

We can also filter groups based on aggregate conditions:

python
# Filter to find regions and categories with total sales > 1000
high_sales_groups = sales_df.groupby(['region', 'category']).filter(lambda x: x['sales'].sum() > 1000)

print("\nHigh Sales Groups:")
print(high_sales_groups.groupby(['region', 'category']).sum())

Using GroupBy.transform() with Hierarchical Groups

The transform() method can apply functions that return data with the same shape as the input, which is useful for operations like normalization within groups:

python
# Add percentage of region total
sales_df['region_pct'] = sales_df.groupby(['region', 'category'])['sales'].transform(
    lambda x: x / x.sum() * 100
)

print("\nSales with Percentage of Region-Category Total:")
print(sales_df[['region', 'category', 'sales', 'region_pct']].head())

Practical Tips for Hierarchical Groupby

Here are some useful tips when working with hierarchical groupby:

1. Managing the Level Order

You can change the order of levels in a hierarchical index:

python
# Swap index levels
result_swapped = result.swaplevel('region', 'product')
print("\nSwapped Levels:")
print(result_swapped)

2. Using get_group()

For direct access to a specific group:

python
# Get a specific group directly
east_apple_group = grouped.get_group(('East', 'Apple'))
print("\nEast Apple Group:")
print(east_apple_group)

3. Sorting the Index

For better readability:

python
# Sort index for clearer presentation
sorted_result = result.sort_index()
print("\nSorted Result:")
print(sorted_result)

Summary

Hierarchical groupby in pandas provides a powerful way to analyze multi-dimensional data:

Group data by multiple columns using df.groupby([col1, col2, ...])
The result has a multi-level index reflecting your grouping structure
Access specific groups using hierarchical indexing
Transform hierarchical results using unstack() for better visualization
Apply multiple aggregations to different columns
Filter, transform, and manipulate hierarchically grouped data

This technique is particularly valuable when analyzing complex datasets with multiple categories, dimensions, or time periods.

Exercises

Create a dataset with information about students, courses, and grades. Use hierarchical groupby to find the average grade by department and course.
Using a sales dataset, group by year, quarter, and product category to analyze seasonal trends.
Practice using unstack() at different levels to reshape hierarchical groupby results.
Experiment with different aggregation functions on a hierarchically grouped dataset.

Additional Resources

By mastering hierarchical groupby operations, you'll greatly enhance your ability to extract insights from complex, multi-dimensional datasets using pandas.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Hierarchical Groupby​

Basic Hierarchical Groupby Syntax​

Understanding the Hierarchical Index​

Aggregate Functions with Hierarchical Groupby​

Unstack for Better Visualization​

Real-World Example: Sales Data Analysis​

Filtering Groups in Hierarchical Groupby​

Using GroupBy.transform() with Hierarchical Groups​

Practical Tips for Hierarchical Groupby​

1. Managing the Level Order​

2. Using get_group()​

3. Sorting the Index​

Summary​

Exercises​

Additional Resources​

Introduction to Hierarchical Groupby

Basic Hierarchical Groupby Syntax

Understanding the Hierarchical Index

Aggregate Functions with Hierarchical Groupby

Unstack for Better Visualization

Real-World Example: Sales Data Analysis

Filtering Groups in Hierarchical Groupby

Using GroupBy.transform() with Hierarchical Groups

Practical Tips for Hierarchical Groupby

1. Managing the Level Order

2. Using get_group()

3. Sorting the Index

Summary

Exercises

Additional Resources