Pandas Custom Aggregations

Introduction

While pandas provides many built-in aggregation functions like mean(), sum(), and count(), there are often situations where you need to perform specialized calculations on your data. This is where custom aggregations come in - they allow you to define your own functions to aggregate data exactly how you need it.

Custom aggregations extend pandas' powerful data manipulation capabilities by enabling you to:

Apply domain-specific logic to your aggregations
Combine multiple operations in a single aggregation step
Implement calculations not available in the standard pandas library
Create reusable aggregation functions for your specific needs

In this tutorial, we'll explore how to create and apply custom aggregation functions in pandas using various techniques.

Basic Custom Aggregation

Using Named Functions

Let's start with a simple example. We'll create a custom function to calculate the range (difference between maximum and minimum values) of a dataset.

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
  Category  Value
      A     10
      A     15
      A     20
      B     25
      B     30
      B     35
      C     40
      C     45
      C     50

Now let's define and apply a custom aggregation function:

python
# Define a custom aggregation function
def value_range(series):
    return series.max() - series.min()

# Apply the custom aggregation
result = df.groupby('Category')['Value'].agg(value_range)
print("\nCustom Aggregation (Range):")
print(result)

Output:

Custom Aggregation (Range):
Category
A    10
B    10
C    10
Name: Value, dtype: int64

Using Lambda Functions

For simple operations, you can use lambda functions directly:

python
# Using lambda function for the same calculation
result_lambda = df.groupby('Category')['Value'].agg(lambda x: x.max() - x.min())
print("\nCustom Aggregation using lambda function:")
print(result_lambda)

Output:

Custom Aggregation using lambda function:
Category
A    10
B    10
C    10
Name: Value, dtype: int64

Multiple Custom Aggregations

One of the powerful features of pandas is the ability to apply multiple aggregations at once, including custom ones:

python
# Define several custom aggregations
def range_percent(series):
    """Calculate range as a percentage of the minimum value"""
    return ((series.max() - series.min()) / series.min()) * 100

# Apply multiple aggregations including custom ones
result_multiple = df.groupby('Category')['Value'].agg([
    'mean',                 # Built-in function
    'std',                  # Built-in function
    ('range', value_range), # Custom function with name
    ('range_pct', range_percent)  # Another custom function with name
])

print("\nMultiple Aggregations:")
print(result_multiple)

Output:

Multiple Aggregations:
          mean        std  range  range_pct
Category                                   
A        15.00   5.000000     10   100.000
B        30.00   5.000000     10    40.000
C        45.00   5.000000     10    25.000

Custom Aggregations on Multiple Columns

You can apply different custom aggregations to different columns:

python
# Create a more complex DataFrame
complex_data = {
    'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value1': [10, 15, 20, 25, 30, 35, 40, 45, 50],
    'Value2': [5, 10, 15, 20, 25, 30, 35, 40, 45]
}

df_complex = pd.DataFrame(complex_data)
print("\nComplex DataFrame:")
print(df_complex)

# Define different aggregations for different columns
agg_dict = {
    'Value1': ['mean', value_range, ('custom_range_pct', range_percent)],
    'Value2': ['sum', 'median', lambda x: x.max() / x.min()]
}

result_complex = df_complex.groupby('Category').agg(agg_dict)
print("\nCustom aggregations on multiple columns:")
print(result_complex)

Output:

Complex DataFrame:
  Category  Value1  Value2
0        A      10       5
1        A      15      10
2        A      20      15
3        B      25      20
4        B      30      25
5        B      35      30
6        C      40      35
7        C      45      40
8        C      50      45

Custom aggregations on multiple columns:
         Value1                          Value2                
           mean value_range custom_range_pct   sum median max/min
Category                                                         
A         15.0          10          100.0      30   10.0     3.00
B         30.0          10           40.0      75   25.0     1.50
C         45.0          10           25.0     120   40.0     1.29

Named Aggregation with `.agg()`

In pandas versions 0.25 and later, there's a cleaner syntax for named aggregations:

python
# Modern named aggregation syntax
result_named = df.groupby('Category')['Value'].agg(
    mean_val='mean',
    std_val='std',
    range_val=value_range,
    range_pct=range_percent
)

print("\nNamed Aggregations:")
print(result_named)

Output:

Named Aggregations:
          mean_val    std_val  range_val  range_pct
Category                                          
A            15.0   5.000000         10     100.00
B            30.0   5.000000         10      40.00
C            45.0   5.000000         10      25.00

Real-World Example: Sales Data Analysis

Let's look at a practical example using a sales dataset:

python
# Create a sample sales dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100, freq='D')
sales_data = {
    'Date': dates,
    'Store': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Product': np.random.choice(['Widget', 'Gadget', 'Tool'], 100),
    'Revenue': np.random.randint(1000, 5000, 100),
    'Units': np.random.randint(50, 250, 100),
    'Returns': np.random.randint(0, 10, 100)
}

sales_df = pd.DataFrame(sales_data)
print("Sample Sales Data:")
print(sales_df.head())

# Custom aggregation functions for business metrics
def net_units(series):
    """Calculate net units sold (units - returns)"""
    return series.sum()

def return_rate(data):
    """Calculate return rate as percentage of sales"""
    return (data['Returns'].sum() / data['Units'].sum()) * 100

def revenue_per_unit(data):
    """Calculate revenue per unit sold"""
    return data['Revenue'].sum() / data['Units'].sum()

# Apply custom business metrics by store
store_metrics = sales_df.groupby('Store').agg({
    'Revenue': ['sum', 'mean'],
    'Units': ['sum', net_units],
    'Returns': ['sum', 'mean']
})

print("\nStore Metrics:")
print(store_metrics)

# Calculate advanced metrics using apply
advanced_metrics = sales_df.groupby('Store').apply(
    lambda x: pd.Series({
        'Return_Rate': return_rate(x),
        'Revenue_Per_Unit': revenue_per_unit(x),
        'Profit_Margin': np.random.uniform(0.15, 0.35)  # Placeholder for demonstration
    })
)

print("\nAdvanced Business Metrics:")
print(advanced_metrics)

Example output (results will vary due to random generation):

Sample Sales Data:
        Date  Store Product  Revenue  Units  Returns
0 2023-01-01  North  Widget     3951    128        4
1 2023-01-02   West  Widget     2254     71        0
2 2023-01-03   East   Tool      1778    165        2
3 2023-01-04   West  Gadget     3957     69        8
4 2023-01-05   East  Gadget     1053    189        5

Store Metrics:
       Revenue              Units          Returns      
           sum       mean    sum net_units    sum      mean
Store                                                     
East    65194  2607.760   3643     3643    122  4.880000
North   55764  2323.500   2480     2480     94  3.916667
South   58627  2567.304   2302     2302     93  4.043478
West    77417  2982.962   3575     3575    121  4.653846

Advanced Business Metrics:
       Return_Rate  Revenue_Per_Unit  Profit_Margin
Store                                             
East     3.349162         17.895965      0.248852
North    3.790323         22.485484      0.216469
South    4.039531         25.468289      0.317022
West     3.384615         21.655944      0.295400

Advanced Custom Aggregation with `transform()` and `apply()`

For more complex scenarios, you might need to use transform() or apply() methods:

python
# Create a sample DataFrame for demoing transform
df_transform = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
})

# Using transform to create percent of group total
df_transform['Percent_of_Group'] = df_transform.groupby('Group')['Value'].transform(
    lambda x: (x / x.sum()) * 100
)

print("\nUsing transform for custom calculations:")
print(df_transform)

# Using apply for more complex operations
def group_metrics(group):
    """Calculate multiple metrics for each group"""
    metrics = {
        'min_value': group['Value'].min(),
        'max_value': group['Value'].max(),
        'range': group['Value'].max() - group['Value'].min(),
        'mean': group['Value'].mean(),
        'count': len(group)
    }
    return pd.Series(metrics)

complex_metrics = df_transform.groupby('Group').apply(group_metrics)

print("\nComplex group metrics using apply:")
print(complex_metrics)

Output:

Using transform for custom calculations:
  Group  Value  Percent_of_Group
0     A     10         16.666667
1     A     20         33.333333
2     A     30         50.000000
3     B     40         26.666667
4     B     50         33.333333
5     B     60         40.000000
6     C     70         29.166667
7     C     80         33.333333
8     C     90         37.500000

Complex group metrics using apply:
      min_value  max_value  range  mean  count
Group                                        
A           10         30     20    20      3
B           40         60     20    50      3
C           70         90     20    80      3

Summary

Custom aggregations in pandas give you the flexibility to perform specialized calculations on your data that go beyond the built-in aggregation functions. Here's what we've learned:

You can create custom aggregation functions using named functions or lambda expressions
Custom aggregations can be applied alongside built-in pandas aggregation functions
Different aggregations can be applied to different columns in the same operation
The modern named aggregation syntax provides a cleaner way to label your custom aggregations
For complex operations, transform() and apply() provide additional flexibility
Custom aggregations are essential for calculating business metrics and domain-specific insights

By mastering custom aggregations, you'll be able to extract exactly the insights you need from your data, tailored to your specific use case.

Exercises

To practice what you've learned, try these exercises:

Create a custom aggregation function that calculates the coefficient of variation (standard deviation divided by mean) for each group
Write a function to find the most common value (mode) in each group
Create a custom function that returns the top 3 values in each group as a list
Calculate the percentage difference between the first and last values in each group
Create an aggregation that computes the z-score of each value within its group

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Custom Aggregation​

Using Named Functions​

Using Lambda Functions​

Multiple Custom Aggregations​

Custom Aggregations on Multiple Columns​

Named Aggregation with .agg()​

Real-World Example: Sales Data Analysis​

Advanced Custom Aggregation with transform() and apply()​

Summary​

Exercises​

Additional Resources​

Introduction

Basic Custom Aggregation

Using Named Functions

Using Lambda Functions

Multiple Custom Aggregations

Custom Aggregations on Multiple Columns

Named Aggregation with `.agg()`

Real-World Example: Sales Data Analysis

Advanced Custom Aggregation with `transform()` and `apply()`

Summary

Exercises

Additional Resources