Pandas Aggregation Functions
In data analysis, we often need to summarize large datasets to extract meaningful insights. Pandas provides a rich set of aggregation functions that allow us to calculate statistics across groups of data. These functions are essential tools for data analysts and scientists who need to condense information and identify patterns.
What are Aggregation Functions?
Aggregation functions are operations that transform multiple values into a single summary value. For example, computing the sum, average, maximum, or minimum of a series of numbers. In Pandas, these functions can be applied to DataFrames and Series to produce concise statistical summaries.
Basic Aggregation Functions
Pandas offers several built-in aggregation functions that can be applied directly to Series and DataFrame objects.
Common Aggregation Functions
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Values': [10, 25, 15, 30, 20, 35],
    'Scores': [88, 92, 75, 87, 91, 84]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
  Category  Values  Scores
0        A      10      88
1        B      25      92
2        A      15      75
3        B      30      87
4        A      20      91
5        B      35      84
Now let's explore some basic aggregation functions:
# Sum of all values
print("\nSum of Values:")
print(df['Values'].sum())
# Mean (average) of all values
print("\nMean of Values:")
print(df['Values'].mean())
# Minimum value
print("\nMinimum Value:")
print(df['Values'].min())
# Maximum value
print("\nMaximum Value:")
print(df['Values'].max())
# Count of values
print("\nCount of Values:")
print(df['Values'].count())
# Standard deviation
print("\nStandard Deviation of Values:")
print(df['Values'].std())
# Multiple aggregations at once
print("\nMultiple aggregations:")
print(df['Values'].agg(['sum', 'mean', 'min', 'max', 'std']))
Output:
Sum of Values:
135
Mean of Values:
22.5
Minimum Value:
10
Maximum Value:
35
Count of Values:
6
Standard Deviation of Values:
9.354143466934852
Multiple aggregations:
sum     135.000000
mean     22.500000
min      10.000000
max      35.000000
std       9.354143
Name: Values, dtype: float64
Applying Aggregations on Multiple Columns
You can apply aggregation functions to multiple columns at once:
# Aggregate operations on all numeric columns
print("\nAggregate statistics for all numeric columns:")
print(df.describe())
# Specific aggregations on multiple columns
print("\nSpecific aggregations on Values and Scores:")
print(df[['Values', 'Scores']].agg(['sum', 'mean', 'max']))
Output:
Aggregate statistics for all numeric columns:
          Values     Scores
count   6.000000   6.000000
mean   22.500000  86.166667
std     9.354143   6.242979
min    10.000000  75.000000
25%    15.000000  84.750000
50%    22.500000  87.500000
75%    28.750000  90.250000
max    35.000000  92.000000
Specific aggregations on Values and Scores:
       Values  Scores
sum    135.0   517.0
mean    22.5    86.2
max     35.0    92.0
Custom Aggregation Functions
Pandas allows you to create and apply custom aggregation functions:
# Define a custom function to calculate the range (max - min)
def value_range(x):
    return x.max() - x.min()
# Apply the custom function
print("\nCustom aggregation (range):")
print(df['Values'].agg(value_range))
# Using a lambda function
print("\nUsing lambda for custom aggregation:")
print(df['Values'].agg(lambda x: x.max() - x.min()))
# Multiple custom aggregations
print("\nMultiple custom aggregations:")
print(df['Values'].agg([
    ('Range', lambda x: x.max() - x.min()),
    ('Median', np.median),
    ('Sum of Squares', lambda x: np.sum(x**2))
]))
Output:
Custom aggregation (range):
25
Using lambda for custom aggregation:
25
Multiple custom aggregations:
Range            25.0
Median           22.5
Sum of Squares  3175.0
Name: Values, dtype: float64
GroupBy with Aggregation
One of the most powerful features of Pandas is the ability to group data and apply aggregations to each group:
# Group by Category and calculate statistics
print("\nGroup by Category with aggregation:")
print(df.groupby('Category')['Values'].agg(['sum', 'mean', 'min', 'max']))
# Multiple aggregations with different functions for different columns
print("\nDifferent aggregations for different columns:")
aggregations = {
    'Values': ['sum', 'mean', 'max'],
    'Scores': ['mean', 'min', 'max']
}
print(df.groupby('Category').agg(aggregations))
Output:
Group by Category with aggregation:
         sum  mean  min  max
Category                     
A         45  15.0   10   20
B         90  30.0   25   35
Different aggregations for different columns:
         Values            Scores          
           sum  mean  max   mean min  max
Category                                  
A          45  15.0   20  84.67  75   91
B          90  30.0   35  87.67  84   92
Named Aggregations
Pandas 0.25+ introduced named aggregations for more readable code:
# Named aggregations (pandas 0.25+)
print("\nNamed aggregations:")
result = df.groupby('Category').agg(
    total_values=('Values', 'sum'),
    avg_values=('Values', 'mean'),
    avg_scores=('Scores', 'mean'),
    max_scores=('Scores', 'max')
)
print(result)
Output:
Named aggregations:
         total_values  avg_values  avg_scores  max_scores
Category                                                 
A                 45        15.0       84.67          91
B                 90        30.0       87.67          92
Real-world Application: Sales Data Analysis
Let's see how aggregation functions can be used in a real-world scenario with sales data:
# Create a sales dataset
sales_data = {
    'Date': pd.date_range(start='2023-01-01', periods=10),
    'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'B'],
    'Region': ['East', 'East', 'West', 'East', 'West', 'West', 'East', 'West', 'East', 'West'],
    'Units': [10, 15, 12, 8, 20, 25, 5, 18, 12, 22],
    'Price': [100, 150, 100, 200, 150, 100, 200, 150, 100, 150],
}
sales_df = pd.DataFrame(sales_data)
# Calculate revenue for each transaction
sales_df['Revenue'] = sales_df['Units'] * sales_df['Price']
print("Sales Data:")
print(sales_df)
# Analyze sales data by product
print("\nSales analysis by product:")
product_analysis = sales_df.groupby('Product').agg(
    total_units=('Units', 'sum'),
    total_revenue=('Revenue', 'sum'),
    avg_price=('Price', 'mean'),
    transactions=('Units', 'count')
)
print(product_analysis)
# Analyze sales data by region and product
print("\nSales analysis by region and product:")
region_product_analysis = sales_df.groupby(['Region', 'Product']).agg(
    total_revenue=('Revenue', 'sum'),
    avg_units=('Units', 'mean')
).sort_values('total_revenue', ascending=False)
print(region_product_analysis)
# Monthly sales trend
sales_df['Month'] = sales_df['Date'].dt.month_name()
print("\nMonthly sales trend:")
monthly_sales = sales_df.groupby('Month').agg(
    total_revenue=('Revenue', 'sum'),
    avg_revenue=('Revenue', 'mean')
)
print(monthly_sales)
Output:
Sales Data:
        Date Product Region  Units  Price  Revenue
0 2023-01-01       A   East     10    100     1000
1 2023-01-02       B   East     15    150     2250
2 2023-01-03       A   West     12    100     1200
3 2023-01-04       C   East      8    200     1600
4 2023-01-05       B   West     20    150     3000
5 2023-01-06       A   West     25    100     2500
6 2023-01-07       C   East      5    200     1000
7 2023-01-08       B   West     18    150     2700
8 2023-01-09       A   East     12    100     1200
9 2023-01-10       B   West     22    150     3300
Sales analysis by product:
         total_units  total_revenue  avg_price  transactions
Product                                                    
A                59          5900        100             4
B                75         11250        150             4
C                13          2600        200             2
Sales analysis by region and product:
                total_revenue  avg_units
Region Product                          
West   B              9000.0       20.0
East   B              2250.0       15.0
West   A              3700.0       18.5
East   C              2600.0        6.5
East   A              2200.0       11.0
Monthly sales trend:
        total_revenue  avg_revenue
Month                            
January         19750      1975.0
Advanced Aggregation Techniques
Using Transform with Aggregation
The transform method allows you to apply aggregated values back to the original DataFrame:
# Add a column with the average price by product
print("\nUsing transform to add average price by product:")
sales_df['avg_product_price'] = sales_df.groupby('Product')['Price'].transform('mean')
print(sales_df[['Product', 'Price', 'avg_product_price']])
# Calculate percentage of total revenue
sales_df['pct_total_revenue'] = sales_df['Revenue'] / sales_df['Revenue'].sum() * 100
print("\nPercentage of total revenue:")
print(sales_df[['Product', 'Revenue', 'pct_total_revenue']])
Output:
Using transform to add average price by product:
   Product  Price  avg_product_price
0        A    100               100
1        B    150               150
2        A    100               100
3        C    200               200
4        B    150               150
5        A    100               100
6        C    200               200
7        B    150               150
8        A    100               100
9        B    150               150
Percentage of total revenue:
   Product  Revenue  pct_total_revenue
0        A     1000          5.063291
1        B     2250         11.392405
2        A     1200          6.075949
3        C     1600          8.101266
4        B     3000         15.189873
5        A     2500         12.658228
6        C     1000          5.063291
7        B     2700         13.670886
8        A     1200          6.075949
9        B     3300         16.708861
Aggregating with Filtering
You can combine aggregation with filtering:
# Filter groups that have a total revenue > 5000
print("\nFiltering groups with high revenue:")
high_revenue_products = sales_df.groupby('Product').filter(lambda x: x['Revenue'].sum() > 5000)
print(high_revenue_products['Product'].unique())
# Aggregating after filtering
print("\nAggregating after filtering - High revenue products only:")
print(high_revenue_products.groupby('Product').agg(
    total_revenue=('Revenue', 'sum'),
    avg_units=('Units', 'mean')
))
Output:
Filtering groups with high revenue:
['A' 'B']
Aggregating after filtering - High revenue products only:
        total_revenue  avg_units
Product                         
A              5900.0      14.75
B             11250.0      18.75
Summary
Pandas aggregation functions are powerful tools for data analysis that allow you to:
- Calculate summary statistics using built-in functions like 
sum,mean,min,max, etc. - Apply aggregations to single or multiple columns
 - Create custom aggregation functions for specialized calculations
 - Group data by categories and apply different aggregations to each group
 - Use named aggregations for more readable code
 - Combine aggregation with other pandas operations like filtering and transformation
 
By mastering these techniques, you can efficiently summarize large datasets and extract valuable insights from your data.
Additional Resources
Exercises
- Create a DataFrame with employee data (name, department, salary, years_of_service) and calculate the average salary by department.
 - Use a custom aggregation function to find the salary range (max - min) for each department.
 - Calculate what percentage each employee's salary contributes to their department's total salary.
 - Find departments where the average years of service is greater than 5.
 - Create a report that shows for each department: the total salary, average salary, minimum and maximum salary, and the number of employees.
 
💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!