Pandas Multiple Aggregations

In data analysis, we often need to calculate multiple statistics on our data simultaneously. Pandas provides powerful capabilities for performing multiple aggregations in a single operation, saving time and creating more readable code.

Introduction to Multiple Aggregations

When analyzing data, calculating a single metric (like the mean or sum) is rarely sufficient. Real-world analysis typically requires understanding various aspects of your data through different statistical measures. Rather than running separate operations for each metric, pandas allows you to perform multiple aggregations in one go.

Multiple aggregations in pandas let you:

Apply different aggregation functions to different columns
Apply multiple aggregation functions to the same column
Create meaningful multi-level result structures
Significantly reduce the amount of code needed for complex analysis

Basic Multiple Aggregations

Let's start with a simple example using a sample DataFrame containing sales data:

import pandas as pd
import numpy as np

# Create sample sales data
data = {
    'date': pd.date_range('2023-01-01', periods=10),
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'C', 'A'],
    'quantity': [5, 7, 3, 6, 2, 8, 4, 9, 1, 3],
    'price': [10, 12, 10, 15, 12, 10, 15, 12, 15, 10]
}

df = pd.DataFrame(data)
df['revenue'] = df['quantity'] * df['price']

print(df)

This creates the following DataFrame:

        date product  quantity  price  revenue
2023-01-01       A         5     10       50
2023-01-02       B         7     12       84
2023-01-03       A         3     10       30
2023-01-04       C         6     15       90
2023-01-05       B         2     12       24
2023-01-06       A         8     10       80
2023-01-07       C         4     15       60
2023-01-08       B         9     12      108
2023-01-09       C         1     15       15
2023-01-10       A         3     10       30

Applying the same aggregation to multiple columns

To apply the same aggregation function to multiple columns, use the agg method:

# Calculate sum of quantity and revenue
result = df.agg({'quantity': 'sum', 'revenue': 'sum'})
print(result)

Output:

quantity     48
revenue     571
dtype: int64

Applying multiple aggregations to a single column

You can apply multiple aggregation functions to a single column:

# Calculate multiple statistics for the quantity column
result = df['quantity'].agg(['sum', 'mean', 'min', 'max', 'count'])
print(result)

Output:

sum      48.0
mean      4.8
min       1.0
max       9.0
count    10.0
dtype: float64

Advanced Multiple Aggregations

GroupBy with multiple aggregations

One of the most powerful applications of multiple aggregations is combining them with GroupBy operations:

# Group by product and calculate multiple statistics
result = df.groupby('product').agg({
    'quantity': ['sum', 'mean', 'count'],
    'price': ['mean'],
    'revenue': ['sum', 'mean']
})

print(result)

Output:

        quantity                price  revenue       
             sum mean count     mean     sum      mean
product                                               
A            19  4.75     4       10     190      47.5
B            18  6.00     3       12     216      72.0
C            11  3.67     3       15     165      55.0

Flattening MultiIndex columns

The result of multiple aggregations is often a DataFrame with a hierarchical (multi-level) column index. We can flatten this for easier access:

# Flatten the multi-level columns
result = df.groupby('product').agg({
    'quantity': ['sum', 'mean'],
    'revenue': ['sum', 'mean']
})

# Flatten the hierarchical index
result.columns = ['_'.join(col).strip() for col in result.columns.values]
print(result)

Output:

        quantity_sum  quantity_mean  revenue_sum  revenue_mean
product                                                       
A                 19           4.75          190         47.50
B                 18           6.00          216         72.00
C                 11           3.67          165         55.00

Using named aggregations (pandas 0.25+)

For more readable code, pandas introduced named aggregations:

# Using named aggregations
result = df.groupby('product').agg(
    total_quantity=('quantity', 'sum'),
    avg_quantity=('quantity', 'mean'),
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean')
)

print(result)

Output:

        total_quantity  avg_quantity  total_revenue  avg_revenue
product                                                        
A                   19          4.75            190        47.50
B                   18          6.00            216        72.00
C                   11          3.67            165        55.00

Custom Aggregation Functions

You can also use custom functions for aggregation:

# Define custom aggregation functions
def range_diff(x):
    return x.max() - x.min()

def median_deviation(x):
    return (x - x.median()).abs().mean()

# Apply custom aggregations
result = df.groupby('product').agg({
    'quantity': ['mean', range_diff, median_deviation],
    'revenue': ['sum', 'mean']
})

print(result)

Output:

        quantity                                  revenue       
            mean range_diff median_deviation         sum      mean
product                                                           
A           4.75          5            1.250         190      47.5
B           6.00          7            2.667         216      72.0
C           3.67          5            1.778         165      55.0

Real-World Application Example

Let's analyze a dataset of sales transactions to generate a comprehensive report:

# Create a more realistic sales dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100)
regions = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Food', 'Books']

sales_data = {
    'date': np.random.choice(dates, 1000),
    'region': np.random.choice(regions, 1000),
    'category': np.random.choice(categories, 1000),
    'units_sold': np.random.randint(1, 50, 1000),
    'unit_price': np.random.uniform(10, 100, 1000).round(2),
}

sales_df = pd.DataFrame(sales_data)
sales_df['revenue'] = sales_df['units_sold'] * sales_df['unit_price']

# Extract month and year for temporal analysis
sales_df['month'] = sales_df['date'].dt.month
sales_df['year'] = sales_df['date'].dt.year

# Generate comprehensive sales report
sales_report = sales_df.groupby(['region', 'category']).agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue_per_transaction=('revenue', 'mean'),
    total_units=('units_sold', 'sum'),
    avg_units_per_transaction=('units_sold', 'mean'),
    transaction_count=('units_sold', 'count'),
    avg_unit_price=('unit_price', 'mean')
).round(2)

# Show the first few rows of the report
print(sales_report.head())

# Find best-performing regions by total revenue
best_regions = sales_df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_transaction_value=('revenue', 'mean'),
    transaction_count=('revenue', 'count')
).sort_values('total_revenue', ascending=False)

print("\nRegions by Performance:")
print(best_regions)

# Monthly revenue trends
monthly_trends = sales_df.groupby(['year', 'month']).agg(
    total_revenue=('revenue', 'sum'),
    total_units=('units_sold', 'sum')
).sort_index()

print("\nMonthly Revenue Trends:")
print(monthly_trends)

This example demonstrates how multiple aggregations can help create a comprehensive business report with various metrics, all calculated efficiently in just a few lines of code.

Performance Considerations

When working with large datasets, there are a few tips to optimize multiple aggregation operations:

Specify required columns: Only include columns you actually need in your groupby operation
Use efficient aggregation functions: Built-in functions like 'sum', 'mean', etc. are highly optimized
Consider the as_index parameter: Setting as_index=False in groupby can be more efficient if you plan to reset the index later
Avoid redundant calculations: Calculate derived metrics after aggregation when possible

Summary

Multiple aggregations in pandas provide a powerful way to analyze data from different angles simultaneously. Key points to remember:

Use the agg() method to perform multiple aggregations
You can apply different aggregations to different columns
Custom aggregation functions can extend the built-in capabilities
Named aggregations provide cleaner and more readable code
GroupBy combined with multiple aggregations is especially useful for generating reports and summaries

With these techniques, you can write more concise, expressive, and efficient data analysis code.

Additional Resources

Exercises

Create a DataFrame with employee data (name, department, salary, years of experience) and use multiple aggregations to find the average salary and experience by department.
Using the sales dataset from our examples, find the category with the highest average revenue per transaction in each region.
Implement a custom aggregation function that calculates the coefficient of variation (standard deviation / mean) and use it alongside standard aggregations.
Create a time series dataset and use multiple aggregations to analyze patterns by different time periods (day, week, month).

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Multiple Aggregations​

Basic Multiple Aggregations​

Applying the same aggregation to multiple columns​

Applying multiple aggregations to a single column​

Advanced Multiple Aggregations​

GroupBy with multiple aggregations​

Flattening MultiIndex columns​

Using named aggregations (pandas 0.25+)​

Custom Aggregation Functions​

Real-World Application Example​

Performance Considerations​

Summary​

Additional Resources​

Exercises​

Introduction to Multiple Aggregations

Basic Multiple Aggregations

Applying the same aggregation to multiple columns

Applying multiple aggregations to a single column

Advanced Multiple Aggregations

GroupBy with multiple aggregations

Flattening MultiIndex columns

Using named aggregations (pandas 0.25+)

Custom Aggregation Functions

Real-World Application Example

Performance Considerations

Summary

Additional Resources

Exercises