Pandas Aggregation Functions
In data analysis, we often need to summarize large datasets to extract meaningful insights. Pandas provides a rich set of aggregation functions that allow us to calculate statistics across groups of data. These functions are essential tools for data analysts and scientists who need to condense information and identify patterns.
What are Aggregation Functions?
Aggregation functions are operations that transform multiple values into a single summary value. For example, computing the sum, average, maximum, or minimum of a series of numbers. In Pandas, these functions can be applied to DataFrames and Series to produce concise statistical summaries.
Basic Aggregation Functions
Pandas offers several built-in aggregation functions that can be applied directly to Series and DataFrame objects.
Common Aggregation Functions
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Values': [10, 25, 15, 30, 20, 35],
'Scores': [88, 92, 75, 87, 91, 84]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Category Values Scores
0 A 10 88
1 B 25 92
2 A 15 75
3 B 30 87
4 A 20 91
5 B 35 84
Now let's explore some basic aggregation functions:
# Sum of all values
print("\nSum of Values:")
print(df['Values'].sum())
# Mean (average) of all values
print("\nMean of Values:")
print(df['Values'].mean())
# Minimum value
print("\nMinimum Value:")
print(df['Values'].min())
# Maximum value
print("\nMaximum Value:")
print(df['Values'].max())
# Count of values
print("\nCount of Values:")
print(df['Values'].count())
# Standard deviation
print("\nStandard Deviation of Values:")
print(df['Values'].std())
# Multiple aggregations at once
print("\nMultiple aggregations:")
print(df['Values'].agg(['sum', 'mean', 'min', 'max', 'std']))
Output:
Sum of Values:
135
Mean of Values:
22.5
Minimum Value:
10
Maximum Value:
35
Count of Values:
6
Standard Deviation of Values:
9.354143466934852
Multiple aggregations:
sum 135.000000
mean 22.500000
min 10.000000
max 35.000000
std 9.354143
Name: Values, dtype: float64
Applying Aggregations on Multiple Columns
You can apply aggregation functions to multiple columns at once:
# Aggregate operations on all numeric columns
print("\nAggregate statistics for all numeric columns:")
print(df.describe())
# Specific aggregations on multiple columns
print("\nSpecific aggregations on Values and Scores:")
print(df[['Values', 'Scores']].agg(['sum', 'mean', 'max']))
Output:
Aggregate statistics for all numeric columns:
Values Scores
count 6.000000 6.000000
mean 22.500000 86.166667
std 9.354143 6.242979
min 10.000000 75.000000
25% 15.000000 84.750000
50% 22.500000 87.500000
75% 28.750000 90.250000
max 35.000000 92.000000
Specific aggregations on Values and Scores:
Values Scores
sum 135.0 517.0
mean 22.5 86.2
max 35.0 92.0
Custom Aggregation Functions
Pandas allows you to create and apply custom aggregation functions:
# Define a custom function to calculate the range (max - min)
def value_range(x):
return x.max() - x.min()
# Apply the custom function
print("\nCustom aggregation (range):")
print(df['Values'].agg(value_range))
# Using a lambda function
print("\nUsing lambda for custom aggregation:")
print(df['Values'].agg(lambda x: x.max() - x.min()))
# Multiple custom aggregations
print("\nMultiple custom aggregations:")
print(df['Values'].agg([
('Range', lambda x: x.max() - x.min()),
('Median', np.median),
('Sum of Squares', lambda x: np.sum(x**2))
]))
Output:
Custom aggregation (range):
25
Using lambda for custom aggregation:
25
Multiple custom aggregations:
Range 25.0
Median 22.5
Sum of Squares 3175.0
Name: Values, dtype: float64
GroupBy with Aggregation
One of the most powerful features of Pandas is the ability to group data and apply aggregations to each group:
# Group by Category and calculate statistics
print("\nGroup by Category with aggregation:")
print(df.groupby('Category')['Values'].agg(['sum', 'mean', 'min', 'max']))
# Multiple aggregations with different functions for different columns
print("\nDifferent aggregations for different columns:")
aggregations = {
'Values': ['sum', 'mean', 'max'],
'Scores': ['mean', 'min', 'max']
}
print(df.groupby('Category').agg(aggregations))
Output:
Group by Category with aggregation:
sum mean min max
Category
A 45 15.0 10 20
B 90 30.0 25 35
Different aggregations for different columns:
Values Scores
sum mean max mean min max
Category
A 45 15.0 20 84.67 75 91
B 90 30.0 35 87.67 84 92
Named Aggregations
Pandas 0.25+ introduced named aggregations for more readable code:
# Named aggregations (pandas 0.25+)
print("\nNamed aggregations:")
result = df.groupby('Category').agg(
total_values=('Values', 'sum'),
avg_values=('Values', 'mean'),
avg_scores=('Scores', 'mean'),
max_scores=('Scores', 'max')
)
print(result)
Output:
Named aggregations:
total_values avg_values avg_scores max_scores
Category
A 45 15.0 84.67 91
B 90 30.0 87.67 92
Real-world Application: Sales Data Analysis
Let's see how aggregation functions can be used in a real-world scenario with sales data:
# Create a sales dataset
sales_data = {
'Date': pd.date_range(start='2023-01-01', periods=10),
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'B'],
'Region': ['East', 'East', 'West', 'East', 'West', 'West', 'East', 'West', 'East', 'West'],
'Units': [10, 15, 12, 8, 20, 25, 5, 18, 12, 22],
'Price': [100, 150, 100, 200, 150, 100, 200, 150, 100, 150],
}
sales_df = pd.DataFrame(sales_data)
# Calculate revenue for each transaction
sales_df['Revenue'] = sales_df['Units'] * sales_df['Price']
print("Sales Data:")
print(sales_df)
# Analyze sales data by product
print("\nSales analysis by product:")
product_analysis = sales_df.groupby('Product').agg(
total_units=('Units', 'sum'),
total_revenue=('Revenue', 'sum'),
avg_price=('Price', 'mean'),
transactions=('Units', 'count')
)
print(product_analysis)
# Analyze sales data by region and product
print("\nSales analysis by region and product:")
region_product_analysis = sales_df.groupby(['Region', 'Product']).agg(
total_revenue=('Revenue', 'sum'),
avg_units=('Units', 'mean')
).sort_values('total_revenue', ascending=False)
print(region_product_analysis)
# Monthly sales trend
sales_df['Month'] = sales_df['Date'].dt.month_name()
print("\nMonthly sales trend:")
monthly_sales = sales_df.groupby('Month').agg(
total_revenue=('Revenue', 'sum'),
avg_revenue=('Revenue', 'mean')
)
print(monthly_sales)
Output:
Sales Data:
Date Product Region Units Price Revenue
0 2023-01-01 A East 10 100 1000
1 2023-01-02 B East 15 150 2250
2 2023-01-03 A West 12 100 1200
3 2023-01-04 C East 8 200 1600
4 2023-01-05 B West 20 150 3000
5 2023-01-06 A West 25 100 2500
6 2023-01-07 C East 5 200 1000
7 2023-01-08 B West 18 150 2700
8 2023-01-09 A East 12 100 1200
9 2023-01-10 B West 22 150 3300
Sales analysis by product:
total_units total_revenue avg_price transactions
Product
A 59 5900 100 4
B 75 11250 150 4
C 13 2600 200 2
Sales analysis by region and product:
total_revenue avg_units
Region Product
West B 9000.0 20.0
East B 2250.0 15.0
West A 3700.0 18.5
East C 2600.0 6.5
East A 2200.0 11.0
Monthly sales trend:
total_revenue avg_revenue
Month
January 19750 1975.0
Advanced Aggregation Techniques
Using Transform with Aggregation
The transform
method allows you to apply aggregated values back to the original DataFrame:
# Add a column with the average price by product
print("\nUsing transform to add average price by product:")
sales_df['avg_product_price'] = sales_df.groupby('Product')['Price'].transform('mean')
print(sales_df[['Product', 'Price', 'avg_product_price']])
# Calculate percentage of total revenue
sales_df['pct_total_revenue'] = sales_df['Revenue'] / sales_df['Revenue'].sum() * 100
print("\nPercentage of total revenue:")
print(sales_df[['Product', 'Revenue', 'pct_total_revenue']])
Output:
Using transform to add average price by product:
Product Price avg_product_price
0 A 100 100
1 B 150 150
2 A 100 100
3 C 200 200
4 B 150 150
5 A 100 100
6 C 200 200
7 B 150 150
8 A 100 100
9 B 150 150
Percentage of total revenue:
Product Revenue pct_total_revenue
0 A 1000 5.063291
1 B 2250 11.392405
2 A 1200 6.075949
3 C 1600 8.101266
4 B 3000 15.189873
5 A 2500 12.658228
6 C 1000 5.063291
7 B 2700 13.670886
8 A 1200 6.075949
9 B 3300 16.708861
Aggregating with Filtering
You can combine aggregation with filtering:
# Filter groups that have a total revenue > 5000
print("\nFiltering groups with high revenue:")
high_revenue_products = sales_df.groupby('Product').filter(lambda x: x['Revenue'].sum() > 5000)
print(high_revenue_products['Product'].unique())
# Aggregating after filtering
print("\nAggregating after filtering - High revenue products only:")
print(high_revenue_products.groupby('Product').agg(
total_revenue=('Revenue', 'sum'),
avg_units=('Units', 'mean')
))
Output:
Filtering groups with high revenue:
['A' 'B']
Aggregating after filtering - High revenue products only:
total_revenue avg_units
Product
A 5900.0 14.75
B 11250.0 18.75
Summary
Pandas aggregation functions are powerful tools for data analysis that allow you to:
- Calculate summary statistics using built-in functions like
sum
,mean
,min
,max
, etc. - Apply aggregations to single or multiple columns
- Create custom aggregation functions for specialized calculations
- Group data by categories and apply different aggregations to each group
- Use named aggregations for more readable code
- Combine aggregation with other pandas operations like filtering and transformation
By mastering these techniques, you can efficiently summarize large datasets and extract valuable insights from your data.
Additional Resources
Exercises
- Create a DataFrame with employee data (name, department, salary, years_of_service) and calculate the average salary by department.
- Use a custom aggregation function to find the salary range (max - min) for each department.
- Calculate what percentage each employee's salary contributes to their department's total salary.
- Find departments where the average years of service is greater than 5.
- Create a report that shows for each department: the total salary, average salary, minimum and maximum salary, and the number of employees.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)