Pandas Split-Apply-Combine
In data analysis, we often need to divide our data into groups, perform operations on each group independently, and then combine the results back together. This pattern is known as "split-apply-combine" and is one of the most powerful techniques in data manipulation and analysis.
Pandas provides excellent support for this workflow through its groupby
functionality, which makes complex aggregation operations intuitive and efficient.
Understanding the Split-Apply-Combine Pattern
The split-apply-combine strategy works in three steps:
- Split: Divide your data into groups based on one or more keys
- Apply: Perform operations on each group independently
- Combine: Bring the results back together into a single data structure
This approach allows you to analyze subsets of your data in a concise and efficient manner, making it ideal for tasks like:
- Computing group statistics
- Transforming data within groups
- Filtering groups based on aggregate properties
Basic GroupBy Operations
Let's start with a simple example to understand how the split-apply-combine pattern works in Pandas:
import pandas as pd
import numpy as np
# Create a sample dataset
data = {
'Category': ['A', 'A', 'B', 'B', 'A', 'C', 'C', 'B'],
'Value': [10, 15, 20, 25, 30, 35, 40, 45],
'Rating': [4, 3, 2, 5, 4, 3, 5, 2]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Category Value Rating
0 A 10 4
1 A 15 3
2 B 20 2
3 B 25 5
4 A 30 4
5 C 35 3
6 C 40 5
7 B 45 2
Now, let's group the data by the 'Category' column and calculate the mean of each group:
# Group by Category and calculate mean
grouped_mean = df.groupby('Category').mean()
print("\nMean values by category:")
print(grouped_mean)
Output:
Mean values by category:
Value Rating
Category
A 18.333333 3.666667
B 30.000000 3.000000
C 37.500000 4.000000
In this example:
- Split: We split the data into groups based on the 'Category' column
- Apply: We applied the
mean()
function to each group - Combine: Pandas automatically combined the results into a new DataFrame
Grouping by Multiple Columns
You can group by multiple columns to create hierarchical groups:
# Create a dataset with multiple categorical columns
data = {
'Department': ['IT', 'IT', 'HR', 'HR', 'IT', 'HR', 'IT', 'HR'],
'Team': ['Dev', 'QA', 'Recruit', 'Admin', 'Dev', 'Recruit', 'QA', 'Admin'],
'Salary': [5000, 4500, 3500, 3000, 4800, 3200, 4700, 3100],
'Experience': [3, 2, 5, 7, 4, 2, 3, 8]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Group by Department and Team
grouped = df.groupby(['Department', 'Team']).mean()
print("\nMean values grouped by Department and Team:")
print(grouped)
Output:
Original DataFrame:
Department Team Salary Experience
0 IT Dev 5000 3
1 IT QA 4500 2
2 HR Recruit 3500 5
3 HR Admin 3000 7
4 IT Dev 4800 4
5 HR Recruit 3200 2
6 IT QA 4700 3
7 HR Admin 3100 8
Mean values grouped by Department and Team:
Salary Experience
Department Team
HR Admin 3050.0 7.50
Recruit 3350.0 3.50
IT Dev 4900.0 3.50
QA 4600.0 2.50
Applying Aggregate Functions
Built-in Aggregation Functions
Pandas provides the agg()
method to apply one or more aggregation functions to your grouped data:
# Apply multiple aggregation functions
aggregated = df.groupby('Department').agg({
'Salary': ['min', 'max', 'mean', 'sum'],
'Experience': ['min', 'max', 'mean']
})
print("Multiple aggregations:")
print(aggregated)
Output:
Multiple aggregations:
Salary Experience
min max mean sum min max mean
Department
HR 3000 3500 3200.000000 9800 2 8 5.50
IT 4500 5000 4750.000000 19000 2 4 3.00
Custom Aggregation Functions
You can also apply custom functions using agg()
:
# Define a custom function
def range_diff(x):
return x.max() - x.min()
# Apply custom and built-in functions
custom_agg = df.groupby('Department').agg({
'Salary': ['mean', range_diff],
'Experience': ['sum', 'count']
})
print("Custom aggregation:")
print(custom_agg)
Output:
Custom aggregation:
Salary Experience
mean range_diff sum count
Department
HR 3200.000000 500 22 4
IT 4750.000000 500 12 4
Transformation Operations
While aggregation reduces each group to a single value, transformation preserves the original DataFrame's shape but applies the function to each group's values:
# Create z-score transformation within each group
def zscore(x):
return (x - x.mean()) / x.std()
# Apply transformation
transformed = df.groupby('Department').transform(zscore)
result = pd.concat([df, transformed.add_suffix('_zscore')], axis=1)
print("Z-score transformation within groups:")
print(result.head())
Output:
Z-score transformation within groups:
Department Team Salary Experience Salary_zscore Experience_zscore
0 IT Dev 5000 3 0.707107 0.000000
1 IT QA 4500 2 -0.707107 -1.414214
2 HR Recruit 3500 5 1.118034 -0.196116
3 HR Admin 3000 7 -0.559017 0.784464
4 IT Dev 4800 4 0.000000 0.707107
Filtering Groups
You can filter groups based on aggregate values:
# Filter groups where the average salary is greater than 4000
filtered = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 4000)
print("Filtered groups (avg salary > 4000):")
print(filtered)
Output:
Filtered groups (avg salary > 4000):
Department Team Salary Experience
0 IT Dev 5000 3
1 IT QA 4500 2
4 IT Dev 4800 4
6 IT QA 4700 3
Real-world Application: Sales Data Analysis
Let's apply the split-apply-combine pattern to analyze some sales data:
# Create a sales dataset
sales_data = {
'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
'Region': ['North', 'South', 'East', 'West', 'North', 'South',
'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A', 'B'],
'Units': [125, 100, 85, 90, 115, 95, 80, 110, 105, 120, 95, 130],
'Revenue': [6250, 5000, 8500, 9000, 11500, 4750, 4000, 11000, 5250, 12000, 4750, 13000]
}
sales_df = pd.DataFrame(sales_data)
sales_df['Month'] = sales_df['Date'].dt.month_name()
sales_df['Quarter'] = sales_df['Date'].dt.quarter
print("Sales Data Sample:")
print(sales_df.head())
# Quarterly sales by region
quarterly_sales = sales_df.groupby(['Quarter', 'Region'])['Revenue'].sum().unstack()
print("\nQuarterly Sales by Region:")
print(quarterly_sales)
# Average unit price by product and region
sales_df['Unit_Price'] = sales_df['Revenue'] / sales_df['Units']
avg_price = sales_df.groupby(['Product', 'Region'])['Unit_Price'].mean().unstack()
print("\nAverage Unit Price by Product and Region:")
print(avg_price)
# Finding top-performing regions by quarter
top_regions = sales_df.groupby(['Quarter', 'Region'])['Revenue'].sum().reset_index()
top_regions_by_quarter = top_regions.loc[top_regions.groupby('Quarter')['Revenue'].idxmax()]
print("\nTop Performing Region by Quarter:")
print(top_regions_by_quarter)
Output (sample):
Sales Data Sample:
Date Region Product Units Revenue Month Quarter
0 2023-01-31 North A 125 6250 January 1
1 2023-02-28 South A 100 5000 February 1
2 2023-03-31 East B 85 8500 March 1
3 2023-04-30 West B 90 9000 April 2
4 2023-05-31 North B 115 11500 May 2
Quarterly Sales by Region:
Region East North South West
Quarter
1 8500 6250 5000 0
2 4000 11500 4750 9000
3 4750 5250 12000 13000
Average Unit Price by Product and Region:
Region East North South West
Product
A 50.000000 53.846154 50.000000 50.000000
B 100.000000 100.000000 100.000000 102.500000
Top Performing Region by Quarter:
Quarter Region Revenue
2 1 East 8500
4 2 North 11500
11 3 West 13000
Summary
The split-apply-combine pattern in Pandas provides a powerful framework for analyzing grouped data. Through the groupby
operation, you can:
- Split your data into meaningful groups
- Apply various operations like aggregation, transformation, or filtering
- Combine results into a new data structure
This pattern is extremely valuable for data analysis tasks where you need to understand how different segments of your data behave. It's among the most commonly used Pandas workflows in real-world data analysis projects.
Exercises
To reinforce your understanding of the split-apply-combine pattern, try these exercises:
- Group a DataFrame by day of week and calculate the mean, median, and standard deviation of a numeric column.
- Create a custom aggregation that finds the 90th percentile of values within each group.
- Use transformation to calculate the percentage of total for each value within its group.
- Filter groups that have at least 5 entries and a standard deviation below a certain threshold.
- For a retail dataset, analyze sales by product category and season, finding which products perform best in different times of year.
Additional Resources
- Pandas GroupBy Documentation
- Split-Apply-Combine Pattern
- Pandas Aggregation Functions
- "Python for Data Analysis" by Wes McKinney - Contains excellent examples of the split-apply-combine pattern
By mastering the split-apply-combine pattern in Pandas, you'll be equipped with one of the most powerful tools in data analysis for working with grouped data efficiently.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)