Pandas Custom Aggregations
Introduction
While pandas provides many built-in aggregation functions like mean()
, sum()
, and count()
, there are often situations where you need to perform specialized calculations on your data. This is where custom aggregations come in - they allow you to define your own functions to aggregate data exactly how you need it.
Custom aggregations extend pandas' powerful data manipulation capabilities by enabling you to:
- Apply domain-specific logic to your aggregations
- Combine multiple operations in a single aggregation step
- Implement calculations not available in the standard pandas library
- Create reusable aggregation functions for your specific needs
In this tutorial, we'll explore how to create and apply custom aggregation functions in pandas using various techniques.
Basic Custom Aggregation
Using Named Functions
Let's start with a simple example. We'll create a custom function to calculate the range (difference between maximum and minimum values) of a dataset.
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Category Value
0 A 10
1 A 15
2 A 20
3 B 25
4 B 30
5 B 35
6 C 40
7 C 45
8 C 50
Now let's define and apply a custom aggregation function:
# Define a custom aggregation function
def value_range(series):
return series.max() - series.min()
# Apply the custom aggregation
result = df.groupby('Category')['Value'].agg(value_range)
print("\nCustom Aggregation (Range):")
print(result)
Output:
Custom Aggregation (Range):
Category
A 10
B 10
C 10
Name: Value, dtype: int64
Using Lambda Functions
For simple operations, you can use lambda functions directly:
# Using lambda function for the same calculation
result_lambda = df.groupby('Category')['Value'].agg(lambda x: x.max() - x.min())
print("\nCustom Aggregation using lambda function:")
print(result_lambda)
Output:
Custom Aggregation using lambda function:
Category
A 10
B 10
C 10
Name: Value, dtype: int64
Multiple Custom Aggregations
One of the powerful features of pandas is the ability to apply multiple aggregations at once, including custom ones:
# Define several custom aggregations
def range_percent(series):
"""Calculate range as a percentage of the minimum value"""
return ((series.max() - series.min()) / series.min()) * 100
# Apply multiple aggregations including custom ones
result_multiple = df.groupby('Category')['Value'].agg([
'mean', # Built-in function
'std', # Built-in function
('range', value_range), # Custom function with name
('range_pct', range_percent) # Another custom function with name
])
print("\nMultiple Aggregations:")
print(result_multiple)
Output:
Multiple Aggregations:
mean std range range_pct
Category
A 15.00 5.000000 10 100.000
B 30.00 5.000000 10 40.000
C 45.00 5.000000 10 25.000
Custom Aggregations on Multiple Columns
You can apply different custom aggregations to different columns:
# Create a more complex DataFrame
complex_data = {
'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Value1': [10, 15, 20, 25, 30, 35, 40, 45, 50],
'Value2': [5, 10, 15, 20, 25, 30, 35, 40, 45]
}
df_complex = pd.DataFrame(complex_data)
print("\nComplex DataFrame:")
print(df_complex)
# Define different aggregations for different columns
agg_dict = {
'Value1': ['mean', value_range, ('custom_range_pct', range_percent)],
'Value2': ['sum', 'median', lambda x: x.max() / x.min()]
}
result_complex = df_complex.groupby('Category').agg(agg_dict)
print("\nCustom aggregations on multiple columns:")
print(result_complex)
Output:
Complex DataFrame:
Category Value1 Value2
0 A 10 5
1 A 15 10
2 A 20 15
3 B 25 20
4 B 30 25
5 B 35 30
6 C 40 35
7 C 45 40
8 C 50 45
Custom aggregations on multiple columns:
Value1 Value2
mean value_range custom_range_pct sum median max/min
Category
A 15.0 10 100.0 30 10.0 3.00
B 30.0 10 40.0 75 25.0 1.50
C 45.0 10 25.0 120 40.0 1.29
Named Aggregation with .agg()
In pandas versions 0.25 and later, there's a cleaner syntax for named aggregations:
# Modern named aggregation syntax
result_named = df.groupby('Category')['Value'].agg(
mean_val='mean',
std_val='std',
range_val=value_range,
range_pct=range_percent
)
print("\nNamed Aggregations:")
print(result_named)
Output:
Named Aggregations:
mean_val std_val range_val range_pct
Category
A 15.0 5.000000 10 100.00
B 30.0 5.000000 10 40.00
C 45.0 5.000000 10 25.00
Real-World Example: Sales Data Analysis
Let's look at a practical example using a sales dataset:
# Create a sample sales dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100, freq='D')
sales_data = {
'Date': dates,
'Store': np.random.choice(['North', 'South', 'East', 'West'], 100),
'Product': np.random.choice(['Widget', 'Gadget', 'Tool'], 100),
'Revenue': np.random.randint(1000, 5000, 100),
'Units': np.random.randint(50, 250, 100),
'Returns': np.random.randint(0, 10, 100)
}
sales_df = pd.DataFrame(sales_data)
print("Sample Sales Data:")
print(sales_df.head())
# Custom aggregation functions for business metrics
def net_units(series):
"""Calculate net units sold (units - returns)"""
return series.sum()
def return_rate(data):
"""Calculate return rate as percentage of sales"""
return (data['Returns'].sum() / data['Units'].sum()) * 100
def revenue_per_unit(data):
"""Calculate revenue per unit sold"""
return data['Revenue'].sum() / data['Units'].sum()
# Apply custom business metrics by store
store_metrics = sales_df.groupby('Store').agg({
'Revenue': ['sum', 'mean'],
'Units': ['sum', net_units],
'Returns': ['sum', 'mean']
})
print("\nStore Metrics:")
print(store_metrics)
# Calculate advanced metrics using apply
advanced_metrics = sales_df.groupby('Store').apply(
lambda x: pd.Series({
'Return_Rate': return_rate(x),
'Revenue_Per_Unit': revenue_per_unit(x),
'Profit_Margin': np.random.uniform(0.15, 0.35) # Placeholder for demonstration
})
)
print("\nAdvanced Business Metrics:")
print(advanced_metrics)
Example output (results will vary due to random generation):
Sample Sales Data:
Date Store Product Revenue Units Returns
0 2023-01-01 North Widget 3951 128 4
1 2023-01-02 West Widget 2254 71 0
2 2023-01-03 East Tool 1778 165 2
3 2023-01-04 West Gadget 3957 69 8
4 2023-01-05 East Gadget 1053 189 5
Store Metrics:
Revenue Units Returns
sum mean sum net_units sum mean
Store
East 65194 2607.760 3643 3643 122 4.880000
North 55764 2323.500 2480 2480 94 3.916667
South 58627 2567.304 2302 2302 93 4.043478
West 77417 2982.962 3575 3575 121 4.653846
Advanced Business Metrics:
Return_Rate Revenue_Per_Unit Profit_Margin
Store
East 3.349162 17.895965 0.248852
North 3.790323 22.485484 0.216469
South 4.039531 25.468289 0.317022
West 3.384615 21.655944 0.295400
Advanced Custom Aggregation with transform()
and apply()
For more complex scenarios, you might need to use transform()
or apply()
methods:
# Create a sample DataFrame for demoing transform
df_transform = pd.DataFrame({
'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
})
# Using transform to create percent of group total
df_transform['Percent_of_Group'] = df_transform.groupby('Group')['Value'].transform(
lambda x: (x / x.sum()) * 100
)
print("\nUsing transform for custom calculations:")
print(df_transform)
# Using apply for more complex operations
def group_metrics(group):
"""Calculate multiple metrics for each group"""
metrics = {
'min_value': group['Value'].min(),
'max_value': group['Value'].max(),
'range': group['Value'].max() - group['Value'].min(),
'mean': group['Value'].mean(),
'count': len(group)
}
return pd.Series(metrics)
complex_metrics = df_transform.groupby('Group').apply(group_metrics)
print("\nComplex group metrics using apply:")
print(complex_metrics)
Output:
Using transform for custom calculations:
Group Value Percent_of_Group
0 A 10 16.666667
1 A 20 33.333333
2 A 30 50.000000
3 B 40 26.666667
4 B 50 33.333333
5 B 60 40.000000
6 C 70 29.166667
7 C 80 33.333333
8 C 90 37.500000
Complex group metrics using apply:
min_value max_value range mean count
Group
A 10 30 20 20 3
B 40 60 20 50 3
C 70 90 20 80 3
Summary
Custom aggregations in pandas give you the flexibility to perform specialized calculations on your data that go beyond the built-in aggregation functions. Here's what we've learned:
- You can create custom aggregation functions using named functions or lambda expressions
- Custom aggregations can be applied alongside built-in pandas aggregation functions
- Different aggregations can be applied to different columns in the same operation
- The modern named aggregation syntax provides a cleaner way to label your custom aggregations
- For complex operations,
transform()
andapply()
provide additional flexibility - Custom aggregations are essential for calculating business metrics and domain-specific insights
By mastering custom aggregations, you'll be able to extract exactly the insights you need from your data, tailored to your specific use case.
Exercises
To practice what you've learned, try these exercises:
- Create a custom aggregation function that calculates the coefficient of variation (standard deviation divided by mean) for each group
- Write a function to find the most common value (mode) in each group
- Create a custom function that returns the top 3 values in each group as a list
- Calculate the percentage difference between the first and last values in each group
- Create an aggregation that computes the z-score of each value within its group
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)