Pandas Split-Apply-Combine

In data analysis, we often need to divide our data into groups, perform operations on each group independently, and then combine the results back together. This pattern is known as "split-apply-combine" and is one of the most powerful techniques in data manipulation and analysis.

Pandas provides excellent support for this workflow through its groupby functionality, which makes complex aggregation operations intuitive and efficient.

Understanding the Split-Apply-Combine Pattern

The split-apply-combine strategy works in three steps:

Split: Divide your data into groups based on one or more keys
Apply: Perform operations on each group independently
Combine: Bring the results back together into a single data structure

This approach allows you to analyze subsets of your data in a concise and efficient manner, making it ideal for tasks like:

Computing group statistics
Transforming data within groups
Filtering groups based on aggregate properties

Basic GroupBy Operations

Let's start with a simple example to understand how the split-apply-combine pattern works in Pandas:

python
import pandas as pd
import numpy as np

# Create a sample dataset
data = {
    'Category': ['A', 'A', 'B', 'B', 'A', 'C', 'C', 'B'],
    'Value': [10, 15, 20, 25, 30, 35, 40, 45],
    'Rating': [4, 3, 2, 5, 4, 3, 5, 2]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
  Category  Value  Rating
      A     10       4
      A     15       3
      B     20       2
      B     25       5
      A     30       4
      C     35       3
      C     40       5
      B     45       2

Now, let's group the data by the 'Category' column and calculate the mean of each group:

python
# Group by Category and calculate mean
grouped_mean = df.groupby('Category').mean()
print("\nMean values by category:")
print(grouped_mean)

Output:

Mean values by category:
          Value    Rating
Category                 
A         18.333333  3.666667
B         30.000000  3.000000
C         37.500000  4.000000

In this example:

Split: We split the data into groups based on the 'Category' column
Apply: We applied the mean() function to each group
Combine: Pandas automatically combined the results into a new DataFrame

Grouping by Multiple Columns

You can group by multiple columns to create hierarchical groups:

python
# Create a dataset with multiple categorical columns
data = {
    'Department': ['IT', 'IT', 'HR', 'HR', 'IT', 'HR', 'IT', 'HR'],
    'Team': ['Dev', 'QA', 'Recruit', 'Admin', 'Dev', 'Recruit', 'QA', 'Admin'],
    'Salary': [5000, 4500, 3500, 3000, 4800, 3200, 4700, 3100],
    'Experience': [3, 2, 5, 7, 4, 2, 3, 8]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Group by Department and Team
grouped = df.groupby(['Department', 'Team']).mean()
print("\nMean values grouped by Department and Team:")
print(grouped)

Output:

Original DataFrame:
  Department    Team  Salary  Experience
0         IT     Dev    5000           3
1         IT      QA    4500           2
2         HR  Recruit    3500           5
3         HR    Admin    3000           7
4         IT     Dev    4800           4
5         HR  Recruit    3200           2
6         IT      QA    4700           3
7         HR    Admin    3100           8

Mean values grouped by Department and Team:
                    Salary  Experience
Department Team                       
HR        Admin     3050.0        7.50
          Recruit   3350.0        3.50
IT        Dev       4900.0        3.50
          QA        4600.0        2.50

Applying Aggregate Functions

Built-in Aggregation Functions

Pandas provides the agg() method to apply one or more aggregation functions to your grouped data:

python
# Apply multiple aggregation functions
aggregated = df.groupby('Department').agg({
    'Salary': ['min', 'max', 'mean', 'sum'],
    'Experience': ['min', 'max', 'mean']
})

print("Multiple aggregations:")
print(aggregated)

Output:

Multiple aggregations:
           Salary                    Experience            
              min   max        mean   sum    min max  mean
Department                                                
HR           3000  3500  3200.000000  9800     2   8  5.50
IT           4500  5000  4750.000000 19000     2   4  3.00

Custom Aggregation Functions

You can also apply custom functions using agg():

python
# Define a custom function
def range_diff(x):
    return x.max() - x.min()

# Apply custom and built-in functions
custom_agg = df.groupby('Department').agg({
    'Salary': ['mean', range_diff],
    'Experience': ['sum', 'count']
})

print("Custom aggregation:")
print(custom_agg)

Output:

Custom aggregation:
           Salary           Experience     
             mean range_diff        sum count
Department                                  
HR        3200.000000        500         22     4
IT        4750.000000        500         12     4

Transformation Operations

While aggregation reduces each group to a single value, transformation preserves the original DataFrame's shape but applies the function to each group's values:

python
# Create z-score transformation within each group
def zscore(x):
    return (x - x.mean()) / x.std()

# Apply transformation
transformed = df.groupby('Department').transform(zscore)
result = pd.concat([df, transformed.add_suffix('_zscore')], axis=1)

print("Z-score transformation within groups:")
print(result.head())

Output:

Z-score transformation within groups:
  Department    Team  Salary  Experience  Salary_zscore  Experience_zscore
       IT     Dev    5000           3      0.707107          0.000000
       IT      QA    4500           2     -0.707107         -1.414214
       HR  Recruit    3500           5      1.118034         -0.196116
       HR    Admin    3000           7     -0.559017          0.784464
       IT     Dev    4800           4      0.000000          0.707107

Filtering Groups

You can filter groups based on aggregate values:

python
# Filter groups where the average salary is greater than 4000
filtered = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 4000)
print("Filtered groups (avg salary > 4000):")
print(filtered)

Output:

Filtered groups (avg salary > 4000):
  Department Team  Salary  Experience
0         IT  Dev    5000           3
1         IT   QA    4500           2
4         IT  Dev    4800           4
6         IT   QA    4700           3

Real-world Application: Sales Data Analysis

Let's apply the split-apply-combine pattern to analyze some sales data:

python
# Create a sales dataset
sales_data = {
    'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 
               'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Units': [125, 100, 85, 90, 115, 95, 80, 110, 105, 120, 95, 130],
    'Revenue': [6250, 5000, 8500, 9000, 11500, 4750, 4000, 11000, 5250, 12000, 4750, 13000]
}

sales_df = pd.DataFrame(sales_data)
sales_df['Month'] = sales_df['Date'].dt.month_name()
sales_df['Quarter'] = sales_df['Date'].dt.quarter

print("Sales Data Sample:")
print(sales_df.head())

# Quarterly sales by region
quarterly_sales = sales_df.groupby(['Quarter', 'Region'])['Revenue'].sum().unstack()
print("\nQuarterly Sales by Region:")
print(quarterly_sales)

# Average unit price by product and region
sales_df['Unit_Price'] = sales_df['Revenue'] / sales_df['Units']
avg_price = sales_df.groupby(['Product', 'Region'])['Unit_Price'].mean().unstack()
print("\nAverage Unit Price by Product and Region:")
print(avg_price)

# Finding top-performing regions by quarter
top_regions = sales_df.groupby(['Quarter', 'Region'])['Revenue'].sum().reset_index()
top_regions_by_quarter = top_regions.loc[top_regions.groupby('Quarter')['Revenue'].idxmax()]
print("\nTop Performing Region by Quarter:")
print(top_regions_by_quarter)

Output (sample):

Sales Data Sample:
        Date Region Product  Units  Revenue      Month  Quarter
0 2023-01-31  North       A    125     6250    January        1
1 2023-02-28  South       A    100     5000   February        1
2 2023-03-31   East       B     85     8500      March        1
3 2023-04-30   West       B     90     9000      April        2
4 2023-05-31  North       B    115    11500       May        2

Quarterly Sales by Region:
Region   East   North   South    West
Quarter                              
1        8500    6250    5000       0
2        4000   11500    4750    9000
3        4750    5250   12000   13000

Average Unit Price by Product and Region:
Region        East      North      South       West
Product                                           
A         50.000000  53.846154  50.000000  50.000000
B        100.000000  100.000000  100.000000  102.500000

Top Performing Region by Quarter:
   Quarter Region  Revenue
2        1   East     8500
4        2  North    11500
11       3   West    13000

Summary

The split-apply-combine pattern in Pandas provides a powerful framework for analyzing grouped data. Through the groupby operation, you can:

Split your data into meaningful groups
Apply various operations like aggregation, transformation, or filtering
Combine results into a new data structure

This pattern is extremely valuable for data analysis tasks where you need to understand how different segments of your data behave. It's among the most commonly used Pandas workflows in real-world data analysis projects.

Exercises

To reinforce your understanding of the split-apply-combine pattern, try these exercises:

Group a DataFrame by day of week and calculate the mean, median, and standard deviation of a numeric column.
Create a custom aggregation that finds the 90th percentile of values within each group.
Use transformation to calculate the percentage of total for each value within its group.
Filter groups that have at least 5 entries and a standard deviation below a certain threshold.
For a retail dataset, analyze sales by product category and season, finding which products perform best in different times of year.

Additional Resources

Pandas GroupBy Documentation
Split-Apply-Combine Pattern
Pandas Aggregation Functions
"Python for Data Analysis" by Wes McKinney - Contains excellent examples of the split-apply-combine pattern

By mastering the split-apply-combine pattern in Pandas, you'll be equipped with one of the most powerful tools in data analysis for working with grouped data efficiently.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Understanding the Split-Apply-Combine Pattern​

Basic GroupBy Operations​

Grouping by Multiple Columns​

Applying Aggregate Functions​

Built-in Aggregation Functions​

Custom Aggregation Functions​

Transformation Operations​

Filtering Groups​

Real-world Application: Sales Data Analysis​

Summary​

Exercises​

Additional Resources​

Understanding the Split-Apply-Combine Pattern

Basic GroupBy Operations

Grouping by Multiple Columns

Applying Aggregate Functions

Built-in Aggregation Functions

Custom Aggregation Functions

Transformation Operations

Filtering Groups

Real-world Application: Sales Data Analysis

Summary

Exercises

Additional Resources