Skip to main content

Pandas Multiple Aggregations

In data analysis, we often need to calculate multiple statistics on our data simultaneously. Pandas provides powerful capabilities for performing multiple aggregations in a single operation, saving time and creating more readable code.

Introduction to Multiple Aggregations

When analyzing data, calculating a single metric (like the mean or sum) is rarely sufficient. Real-world analysis typically requires understanding various aspects of your data through different statistical measures. Rather than running separate operations for each metric, pandas allows you to perform multiple aggregations in one go.

Multiple aggregations in pandas let you:

  • Apply different aggregation functions to different columns
  • Apply multiple aggregation functions to the same column
  • Create meaningful multi-level result structures
  • Significantly reduce the amount of code needed for complex analysis

Basic Multiple Aggregations

Let's start with a simple example using a sample DataFrame containing sales data:

python
import pandas as pd
import numpy as np

# Create sample sales data
data = {
'date': pd.date_range('2023-01-01', periods=10),
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'C', 'A'],
'quantity': [5, 7, 3, 6, 2, 8, 4, 9, 1, 3],
'price': [10, 12, 10, 15, 12, 10, 15, 12, 15, 10]
}

df = pd.DataFrame(data)
df['revenue'] = df['quantity'] * df['price']

print(df)

This creates the following DataFrame:

        date product  quantity  price  revenue
0 2023-01-01 A 5 10 50
1 2023-01-02 B 7 12 84
2 2023-01-03 A 3 10 30
3 2023-01-04 C 6 15 90
4 2023-01-05 B 2 12 24
5 2023-01-06 A 8 10 80
6 2023-01-07 C 4 15 60
7 2023-01-08 B 9 12 108
8 2023-01-09 C 1 15 15
9 2023-01-10 A 3 10 30

Applying the same aggregation to multiple columns

To apply the same aggregation function to multiple columns, use the agg method:

python
# Calculate sum of quantity and revenue
result = df.agg({'quantity': 'sum', 'revenue': 'sum'})
print(result)

Output:

quantity     48
revenue 571
dtype: int64

Applying multiple aggregations to a single column

You can apply multiple aggregation functions to a single column:

python
# Calculate multiple statistics for the quantity column
result = df['quantity'].agg(['sum', 'mean', 'min', 'max', 'count'])
print(result)

Output:

sum      48.0
mean 4.8
min 1.0
max 9.0
count 10.0
dtype: float64

Advanced Multiple Aggregations

GroupBy with multiple aggregations

One of the most powerful applications of multiple aggregations is combining them with GroupBy operations:

python
# Group by product and calculate multiple statistics
result = df.groupby('product').agg({
'quantity': ['sum', 'mean', 'count'],
'price': ['mean'],
'revenue': ['sum', 'mean']
})

print(result)

Output:

        quantity                price  revenue       
sum mean count mean sum mean
product
A 19 4.75 4 10 190 47.5
B 18 6.00 3 12 216 72.0
C 11 3.67 3 15 165 55.0

Flattening MultiIndex columns

The result of multiple aggregations is often a DataFrame with a hierarchical (multi-level) column index. We can flatten this for easier access:

python
# Flatten the multi-level columns
result = df.groupby('product').agg({
'quantity': ['sum', 'mean'],
'revenue': ['sum', 'mean']
})

# Flatten the hierarchical index
result.columns = ['_'.join(col).strip() for col in result.columns.values]
print(result)

Output:

        quantity_sum  quantity_mean  revenue_sum  revenue_mean
product
A 19 4.75 190 47.50
B 18 6.00 216 72.00
C 11 3.67 165 55.00

Using named aggregations (pandas 0.25+)

For more readable code, pandas introduced named aggregations:

python
# Using named aggregations
result = df.groupby('product').agg(
total_quantity=('quantity', 'sum'),
avg_quantity=('quantity', 'mean'),
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean')
)

print(result)

Output:

        total_quantity  avg_quantity  total_revenue  avg_revenue
product
A 19 4.75 190 47.50
B 18 6.00 216 72.00
C 11 3.67 165 55.00

Custom Aggregation Functions

You can also use custom functions for aggregation:

python
# Define custom aggregation functions
def range_diff(x):
return x.max() - x.min()

def median_deviation(x):
return (x - x.median()).abs().mean()

# Apply custom aggregations
result = df.groupby('product').agg({
'quantity': ['mean', range_diff, median_deviation],
'revenue': ['sum', 'mean']
})

print(result)

Output:

        quantity                                  revenue       
mean range_diff median_deviation sum mean
product
A 4.75 5 1.250 190 47.5
B 6.00 7 2.667 216 72.0
C 3.67 5 1.778 165 55.0

Real-World Application Example

Let's analyze a dataset of sales transactions to generate a comprehensive report:

python
# Create a more realistic sales dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100)
regions = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Food', 'Books']

sales_data = {
'date': np.random.choice(dates, 1000),
'region': np.random.choice(regions, 1000),
'category': np.random.choice(categories, 1000),
'units_sold': np.random.randint(1, 50, 1000),
'unit_price': np.random.uniform(10, 100, 1000).round(2),
}

sales_df = pd.DataFrame(sales_data)
sales_df['revenue'] = sales_df['units_sold'] * sales_df['unit_price']

# Extract month and year for temporal analysis
sales_df['month'] = sales_df['date'].dt.month
sales_df['year'] = sales_df['date'].dt.year

# Generate comprehensive sales report
sales_report = sales_df.groupby(['region', 'category']).agg(
total_revenue=('revenue', 'sum'),
avg_revenue_per_transaction=('revenue', 'mean'),
total_units=('units_sold', 'sum'),
avg_units_per_transaction=('units_sold', 'mean'),
transaction_count=('units_sold', 'count'),
avg_unit_price=('unit_price', 'mean')
).round(2)

# Show the first few rows of the report
print(sales_report.head())

# Find best-performing regions by total revenue
best_regions = sales_df.groupby('region').agg(
total_revenue=('revenue', 'sum'),
avg_transaction_value=('revenue', 'mean'),
transaction_count=('revenue', 'count')
).sort_values('total_revenue', ascending=False)

print("\nRegions by Performance:")
print(best_regions)

# Monthly revenue trends
monthly_trends = sales_df.groupby(['year', 'month']).agg(
total_revenue=('revenue', 'sum'),
total_units=('units_sold', 'sum')
).sort_index()

print("\nMonthly Revenue Trends:")
print(monthly_trends)

This example demonstrates how multiple aggregations can help create a comprehensive business report with various metrics, all calculated efficiently in just a few lines of code.

Performance Considerations

When working with large datasets, there are a few tips to optimize multiple aggregation operations:

  1. Specify required columns: Only include columns you actually need in your groupby operation
  2. Use efficient aggregation functions: Built-in functions like 'sum', 'mean', etc. are highly optimized
  3. Consider the as_index parameter: Setting as_index=False in groupby can be more efficient if you plan to reset the index later
  4. Avoid redundant calculations: Calculate derived metrics after aggregation when possible

Summary

Multiple aggregations in pandas provide a powerful way to analyze data from different angles simultaneously. Key points to remember:

  • Use the agg() method to perform multiple aggregations
  • You can apply different aggregations to different columns
  • Custom aggregation functions can extend the built-in capabilities
  • Named aggregations provide cleaner and more readable code
  • GroupBy combined with multiple aggregations is especially useful for generating reports and summaries

With these techniques, you can write more concise, expressive, and efficient data analysis code.

Additional Resources

Exercises

  1. Create a DataFrame with employee data (name, department, salary, years of experience) and use multiple aggregations to find the average salary and experience by department.

  2. Using the sales dataset from our examples, find the category with the highest average revenue per transaction in each region.

  3. Implement a custom aggregation function that calculates the coefficient of variation (standard deviation / mean) and use it alongside standard aggregations.

  4. Create a time series dataset and use multiple aggregations to analyze patterns by different time periods (day, week, month).



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)