Pandas Groupby Basics
Introduction
When analyzing data, we often need to split our data into groups based on certain criteria, apply some operations on each group, and then potentially combine the results. This split-apply-combine pattern is at the heart of data analysis, and pandas provides a powerful tool for this purpose: the GroupBy
functionality.
In this tutorial, we'll explore the basics of pandas' GroupBy
operation, which allows you to:
- Group data based on values in one or more columns
- Perform aggregation operations on these groups
- Transform data within these groups
- Filter groups based on certain conditions
Prerequisites
Before we begin, make sure you have pandas installed:
# Install pandas if you haven't already
# !pip install pandas
# Import the necessary libraries
import pandas as pd
import numpy as np
Creating a Sample Dataset
Let's create a sample dataset to work with throughout this tutorial:
# Create a sample dataframe
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Store': ['S1', 'S1', 'S2', 'S2', 'S1', 'S2', 'S1', 'S2'],
'Sales': [100, 200, 150, 250, 120, 180, 90, 220],
'Items': [10, 15, 12, 18, 8, 20, 7, 16]
}
df = pd.DataFrame(data)
print(df)
Output:
Category Store Sales Items
0 A S1 100 10
1 B S1 200 15
2 A S2 150 12
3 B S2 250 18
4 A S1 120 8
5 B S2 180 20
6 A S1 90 7
7 B S2 220 16
Basic GroupBy Operations
Creating a GroupBy Object
The first step in using GroupBy is to create a GroupBy object by specifying one or more columns to group by:
# Group by a single column
grouped_by_category = df.groupby('Category')
# Group by multiple columns
grouped_by_category_store = df.groupby(['Category', 'Store'])
A GroupBy object itself doesn't display much useful information when printed, but it's a powerful object that we can use for various operations.
Basic Aggregations
The most common operation with GroupBy is aggregation. Let's see how we can aggregate our data:
# Calculate sum for each group
sum_by_category = df.groupby('Category').sum()
print("Sum by Category:")
print(sum_by_category)
Output:
Sum by Category:
Sales Items
Category
A 460 37
B 850 69
# Calculate mean for each group
mean_by_category = df.groupby('Category').mean()
print("\nMean by Category:")
print(mean_by_category)
Output:
Mean by Category:
Sales Items
Category
A 115.0 9.25
B 212.5 17.25
# Calculate multiple aggregations at once
agg_by_category = df.groupby('Category').agg(['sum', 'mean', 'count'])
print("\nMultiple aggregations by Category:")
print(agg_by_category)
Output:
Multiple aggregations by Category:
Sales Items
sum mean count sum mean count
Category
A 460 115.0 4 37 9.25 4
B 850 212.5 4 69 17.25 4
Grouping by Multiple Columns
Let's see how we can group by multiple columns:
# Group by Category and Store
grouped_multi = df.groupby(['Category', 'Store'])
sum_multi = grouped_multi.sum()
print("Sum by Category and Store:")
print(sum_multi)
Output:
Sum by Category and Store:
Sales Items
Category Store
A S1 310 25
S2 150 12
B S1 200 15
S2 650 54
Notice how the index now has multiple levels. This is a hierarchical or MultiIndex.
Accessing Specific Groups
You can access specific groups from a GroupBy object:
# Get the group data for category 'A'
group_a = df.groupby('Category').get_group('A')
print("Data for Category A:")
print(group_a)
Output:
Data for Category A:
Category Store Sales Items
0 A S1 100 10
2 A S2 150 12
4 A S1 120 8
6 A S1 90 7
Common Aggregation Methods
Pandas provides many built-in aggregation functions:
# Basic statistical aggregations
print("Count by Category:")
print(df.groupby('Category').count())
print("\nMin by Category:")
print(df.groupby('Category').min())
print("\nMax by Category:")
print(df.groupby('Category').max())
print("\nStandard Deviation by Category:")
print(df.groupby('Category').std())
Output:
Count by Category:
Store Sales Items
Category
A 4 4 4
B 4 4 4
Min by Category:
Store Sales Items
Category
A S1 90 7
B S1 180 15
Max by Category:
Store Sales Items
Category
A S2 150 12
B S2 250 20
Standard Deviation by Category:
Sales Items
Category
A 26.0384 2.217356
B 31.7805 2.217356
Custom Aggregation Functions
You can also define your own aggregation functions:
# Custom aggregation - calculate range (max - min)
def get_range(x):
return x.max() - x.min()
print("Range by Category:")
print(df.groupby('Category').agg(get_range))
Output:
Range by Category:
Sales Items
Category
A 60 5
B 70 5
Different Aggregations for Different Columns
You can apply different aggregation functions to different columns:
# Apply different aggregations to different columns
custom_agg = df.groupby('Category').agg({
'Sales': ['sum', 'mean', 'max'],
'Items': ['min', 'max', get_range]
})
print("Custom aggregations by column:")
print(custom_agg)
Output:
Custom aggregations by column:
Sales Items
sum mean max min max range
Category
A 460 115.0 150 7 12 5
B 850 212.5 250 15 20 5
Iterating Through Groups
You can iterate through the groups in a GroupBy object:
# Iterate through the groups
print("Iterating through Category groups:")
for name, group in df.groupby('Category'):
print(f"\nCategory: {name}")
print(group)
Output:
Iterating through Category groups:
Category: A
Category Store Sales Items
0 A S1 100 10
2 A S2 150 12
4 A S1 120 8
6 A S1 90 7
Category: B
Category Store Sales Items
1 B S1 200 15
3 B S2 250 18
5 B S2 180 20
7 B S2 220 16
Transformation and Filtering
Transformations
The transform
method allows you to perform operations that return data with the same shape as the input:
# Calculate the percentage of sales within each category
df['SalesPercent'] = df.groupby('Category')['Sales'].transform(
lambda x: x / x.sum() * 100
)
print("DataFrame with Sales Percentage column:")
print(df)
Output:
Category Store Sales Items SalesPercent
0 A S1 100 10 21.739130
1 B S1 200 15 23.529412
2 A S2 150 12 32.608696
3 B S2 250 18 29.411765
4 A S1 120 8 26.086957
5 B S2 180 20 21.176471
6 A S1 90 7 19.565217
7 B S2 220 16 25.882353
Filtering
The filter
method lets you select groups based on a condition:
# Select only groups where total sales is greater than 400
filtered_df = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 400)
print("Groups with total sales > 400:")
print(filtered_df)
Output:
Groups with total sales > 400:
Category Store Sales Items SalesPercent
0 A S1 100 10 21.739130
1 B S1 200 15 23.529412
2 A S2 150 12 32.608696
3 B S2 250 18 29.411765
4 A S1 120 8 26.086957
5 B S2 180 20 21.176471
6 A S1 90 7 19.565217
7 B S2 220 16 25.882353
Both categories have total sales > 400, so all rows are retained. Let's try a different threshold:
# Select only groups where total sales is greater than 500
filtered_df = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 500)
print("Groups with total sales > 500:")
print(filtered_df)
Output:
Groups with total sales > 500:
Category Store Sales Items SalesPercent
1 B S1 200 15 23.529412
3 B S2 250 18 29.411765
5 B S2 180 20 21.176471
7 B S2 220 16 25.882353
Real-World Example: Sales Analysis
Let's apply what we've learned to a more realistic example. We'll analyze a sales dataset:
# Create a more realistic sales dataset
np.random.seed(42)
products = ['Widget', 'Gadget', 'Doohickey', 'Thingamajig']
regions = ['North', 'South', 'East', 'West']
dates = pd.date_range('2023-01-01', periods=30)
sales_data = {
'Date': np.random.choice(dates, size=100),
'Product': np.random.choice(products, size=100),
'Region': np.random.choice(regions, size=100),
'Units': np.random.randint(1, 50, size=100),
'Revenue': np.random.randint(100, 1000, size=100),
'Cost': np.random.randint(50, 500, size=100)
}
sales_df = pd.DataFrame(sales_data)
sales_df['Profit'] = sales_df['Revenue'] - sales_df['Cost']
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
sales_df['Month'] = sales_df['Date'].dt.month
print("Sample of sales data:")
print(sales_df.head())
Output:
Sample of sales data:
Date Product Region Units Revenue Cost Profit Month
0 2023-01-14 Widget North 37 610 255 355 1
1 2023-01-24 Thingamajig East 29 350 475 -125 1
2 2023-01-18 Widget South 23 363 456 -93 1
3 2023-01-18 Doohickey East 34 525 372 153 1
4 2023-01-13 Gadget South 28 911 400 511 1
Now, let's analyze this data using GroupBy:
# 1. Total sales by product
product_sales = sales_df.groupby('Product').agg({
'Units': 'sum',
'Revenue': 'sum',
'Profit': ['sum', 'mean']
}).sort_values(('Revenue', 'sum'), ascending=False)
print("Sales by Product:")
print(product_sales)
# 2. Monthly performance by region
monthly_region = sales_df.groupby(['Month', 'Region']).agg({
'Revenue': 'sum',
'Profit': 'sum'
})
print("\nMonthly performance by Region:")
print(monthly_region)
# 3. Find the most profitable product in each region
best_products = sales_df.groupby(['Region', 'Product'])['Profit'].sum().reset_index()
most_profitable = best_products.loc[best_products.groupby('Region')['Profit'].idxmax()]
print("\nMost profitable product by Region:")
print(most_profitable)
# 4. Calculate profitability ratios
sales_df['Profit_Margin'] = sales_df.groupby('Product')['Profit'].transform(
lambda x: x.sum() / sales_df.loc[x.index, 'Revenue'].sum() * 100
)
print("\nProduct profitability:")
print(sales_df.groupby('Product')[['Revenue', 'Profit', 'Profit_Margin']].mean())
This would give you a comprehensive analysis of your sales data across different dimensions.
Summary
The GroupBy
functionality in pandas is a powerful tool for data analysis that allows you to:
- Group your data based on values in one or more columns
- Apply various aggregation functions to different columns
- Transform your data while maintaining the original structure
- Filter groups based on aggregate conditions
Understanding GroupBy is essential for effective data manipulation and analysis in pandas. It follows the split-apply-combine pattern that is fundamental to data analysis:
- Split your data into groups
- Apply operations to each group
- Combine the results into a new data structure
Exercises
To reinforce your learning, try these exercises:
-
Create a dataframe with student grades for different subjects and:
- Calculate the average grade by subject
- Find the highest-performing student in each subject
- Compute each student's percentile rank within each subject
-
Using the sales dataset from our real-world example:
- Identify underperforming products by region
- Calculate quarter-over-quarter growth by product
- Find the most consistent product (lowest standard deviation in sales)
-
Import a real-world dataset of your choice and apply GroupBy operations to gain insights from the data.
Additional Resources
- Pandas GroupBy Documentation
- 10 Minutes to pandas - GroupBy section
- Python Data Science Handbook - GroupBy Chapter
Happy analyzing!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)