Pandas Groupby Basics

Introduction

When analyzing data, we often need to split our data into groups based on certain criteria, apply some operations on each group, and then potentially combine the results. This split-apply-combine pattern is at the heart of data analysis, and pandas provides a powerful tool for this purpose: the GroupBy functionality.

In this tutorial, we'll explore the basics of pandas' GroupBy operation, which allows you to:

Group data based on values in one or more columns
Perform aggregation operations on these groups
Transform data within these groups
Filter groups based on certain conditions

Prerequisites

Before we begin, make sure you have pandas installed:

# Install pandas if you haven't already
# !pip install pandas

# Import the necessary libraries
import pandas as pd
import numpy as np

Creating a Sample Dataset

Let's create a sample dataset to work with throughout this tutorial:

# Create a sample dataframe
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Store': ['S1', 'S1', 'S2', 'S2', 'S1', 'S2', 'S1', 'S2'],
    'Sales': [100, 200, 150, 250, 120, 180, 90, 220],
    'Items': [10, 15, 12, 18, 8, 20, 7, 16]
}

df = pd.DataFrame(data)
print(df)

Output:

  Category Store  Sales  Items
      A    S1    100     10
      B    S1    200     15
      A    S2    150     12
      B    S2    250     18
      A    S1    120      8
      B    S2    180     20
      A    S1     90      7
      B    S2    220     16

Basic GroupBy Operations

Creating a GroupBy Object

The first step in using GroupBy is to create a GroupBy object by specifying one or more columns to group by:

# Group by a single column
grouped_by_category = df.groupby('Category')

# Group by multiple columns
grouped_by_category_store = df.groupby(['Category', 'Store'])

A GroupBy object itself doesn't display much useful information when printed, but it's a powerful object that we can use for various operations.

Basic Aggregations

The most common operation with GroupBy is aggregation. Let's see how we can aggregate our data:

# Calculate sum for each group
sum_by_category = df.groupby('Category').sum()
print("Sum by Category:")
print(sum_by_category)

Output:

Sum by Category:
          Sales  Items
Category              
A           460     37
B           850     69

# Calculate mean for each group
mean_by_category = df.groupby('Category').mean()
print("\nMean by Category:")
print(mean_by_category)

Output:

Mean by Category:
          Sales  Items
Category              
A          115.0   9.25
B          212.5  17.25

# Calculate multiple aggregations at once
agg_by_category = df.groupby('Category').agg(['sum', 'mean', 'count'])
print("\nMultiple aggregations by Category:")
print(agg_by_category)

Output:

Multiple aggregations by Category:
         Sales                Items              
           sum   mean count    sum   mean count
Category                                        
A          460  115.0     4     37   9.25     4
B          850  212.5     4     69  17.25     4

Grouping by Multiple Columns

Let's see how we can group by multiple columns:

# Group by Category and Store
grouped_multi = df.groupby(['Category', 'Store'])
sum_multi = grouped_multi.sum()
print("Sum by Category and Store:")
print(sum_multi)

Output:

Sum by Category and Store:
                 Sales  Items
Category Store               
A        S1        310     25
         S2        150     12
B        S1        200     15
         S2        650     54

Notice how the index now has multiple levels. This is a hierarchical or MultiIndex.

Accessing Specific Groups

You can access specific groups from a GroupBy object:

# Get the group data for category 'A'
group_a = df.groupby('Category').get_group('A')
print("Data for Category A:")
print(group_a)

Output:

Data for Category A:
  Category Store  Sales  Items
0        A    S1    100     10
2        A    S2    150     12
4        A    S1    120      8
6        A    S1     90      7

Common Aggregation Methods

Pandas provides many built-in aggregation functions:

# Basic statistical aggregations
print("Count by Category:")
print(df.groupby('Category').count())

print("\nMin by Category:")
print(df.groupby('Category').min())

print("\nMax by Category:")
print(df.groupby('Category').max())

print("\nStandard Deviation by Category:")
print(df.groupby('Category').std())

Output:

Count by Category:
          Store  Sales  Items
Category                     
A             4      4      4
B             4      4      4

Min by Category:
         Store  Sales  Items
Category                     
A          S1     90      7
B          S1    180     15

Max by Category:
         Store  Sales  Items
Category                     
A          S2    150     12
B          S2    250     20

Standard Deviation by Category:
          Sales     Items
Category                 
A        26.0384  2.217356
B        31.7805  2.217356

Custom Aggregation Functions

You can also define your own aggregation functions:

# Custom aggregation - calculate range (max - min)
def get_range(x):
    return x.max() - x.min()

print("Range by Category:")
print(df.groupby('Category').agg(get_range))

Output:

Range by Category:
          Sales  Items
Category              
A            60      5
B            70      5

Different Aggregations for Different Columns

You can apply different aggregation functions to different columns:

# Apply different aggregations to different columns
custom_agg = df.groupby('Category').agg({
    'Sales': ['sum', 'mean', 'max'],
    'Items': ['min', 'max', get_range]
})
print("Custom aggregations by column:")
print(custom_agg)

Output:

Custom aggregations by column:
         Sales                Items          
           sum   mean  max    min  max range
Category                                    
A          460  115.0  150      7   12     5
B          850  212.5  250     15   20     5

Iterating Through Groups

You can iterate through the groups in a GroupBy object:

# Iterate through the groups
print("Iterating through Category groups:")
for name, group in df.groupby('Category'):
    print(f"\nCategory: {name}")
    print(group)

Output:

Iterating through Category groups:

Category: A
  Category Store  Sales  Items
0        A    S1    100     10
2        A    S2    150     12
4        A    S1    120      8
6        A    S1     90      7

Category: B
  Category Store  Sales  Items
1        B    S1    200     15
3        B    S2    250     18
5        B    S2    180     20
7        B    S2    220     16

Transformation and Filtering

Transformations

The transform method allows you to perform operations that return data with the same shape as the input:

# Calculate the percentage of sales within each category
df['SalesPercent'] = df.groupby('Category')['Sales'].transform(
    lambda x: x / x.sum() * 100
)
print("DataFrame with Sales Percentage column:")
print(df)

Output:

  Category Store  Sales  Items  SalesPercent
      A    S1    100     10     21.739130
      B    S1    200     15     23.529412
      A    S2    150     12     32.608696
      B    S2    250     18     29.411765
      A    S1    120      8     26.086957
      B    S2    180     20     21.176471
      A    S1     90      7     19.565217
      B    S2    220     16     25.882353

Filtering

The filter method lets you select groups based on a condition:

# Select only groups where total sales is greater than 400
filtered_df = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 400)
print("Groups with total sales > 400:")
print(filtered_df)

Output:

Groups with total sales > 400:
  Category Store  Sales  Items  SalesPercent
      A    S1    100     10     21.739130
      B    S1    200     15     23.529412
      A    S2    150     12     32.608696
      B    S2    250     18     29.411765
      A    S1    120      8     26.086957
      B    S2    180     20     21.176471
      A    S1     90      7     19.565217
      B    S2    220     16     25.882353

Both categories have total sales > 400, so all rows are retained. Let's try a different threshold:

# Select only groups where total sales is greater than 500
filtered_df = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 500)
print("Groups with total sales > 500:")
print(filtered_df)

Output:

Groups with total sales > 500:
  Category Store  Sales  Items  SalesPercent
1        B    S1    200     15     23.529412
3        B    S2    250     18     29.411765
5        B    S2    180     20     21.176471
7        B    S2    220     16     25.882353

Real-World Example: Sales Analysis

Let's apply what we've learned to a more realistic example. We'll analyze a sales dataset:

# Create a more realistic sales dataset
np.random.seed(42)
products = ['Widget', 'Gadget', 'Doohickey', 'Thingamajig']
regions = ['North', 'South', 'East', 'West']
dates = pd.date_range('2023-01-01', periods=30)

sales_data = {
    'Date': np.random.choice(dates, size=100),
    'Product': np.random.choice(products, size=100),
    'Region': np.random.choice(regions, size=100),
    'Units': np.random.randint(1, 50, size=100),
    'Revenue': np.random.randint(100, 1000, size=100),
    'Cost': np.random.randint(50, 500, size=100)
}

sales_df = pd.DataFrame(sales_data)
sales_df['Profit'] = sales_df['Revenue'] - sales_df['Cost']
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
sales_df['Month'] = sales_df['Date'].dt.month

print("Sample of sales data:")
print(sales_df.head())

Output:

Sample of sales data:
        Date      Product Region  Units  Revenue  Cost  Profit  Month
2023-01-14       Widget  North     37      610   255    355      1
2023-01-24  Thingamajig   East     29      350   475   -125      1
2023-01-18      Widget  South     23      363   456    -93      1
2023-01-18  Doohickey   East     34      525   372    153      1
2023-01-13       Gadget  South     28      911   400    511      1

Now, let's analyze this data using GroupBy:

# 1. Total sales by product
product_sales = sales_df.groupby('Product').agg({
    'Units': 'sum',
    'Revenue': 'sum',
    'Profit': ['sum', 'mean']
}).sort_values(('Revenue', 'sum'), ascending=False)

print("Sales by Product:")
print(product_sales)

# 2. Monthly performance by region
monthly_region = sales_df.groupby(['Month', 'Region']).agg({
    'Revenue': 'sum',
    'Profit': 'sum'
})

print("\nMonthly performance by Region:")
print(monthly_region)

# 3. Find the most profitable product in each region
best_products = sales_df.groupby(['Region', 'Product'])['Profit'].sum().reset_index()
most_profitable = best_products.loc[best_products.groupby('Region')['Profit'].idxmax()]

print("\nMost profitable product by Region:")
print(most_profitable)

# 4. Calculate profitability ratios
sales_df['Profit_Margin'] = sales_df.groupby('Product')['Profit'].transform(
    lambda x: x.sum() / sales_df.loc[x.index, 'Revenue'].sum() * 100
)

print("\nProduct profitability:")
print(sales_df.groupby('Product')[['Revenue', 'Profit', 'Profit_Margin']].mean())

This would give you a comprehensive analysis of your sales data across different dimensions.

Summary

The GroupBy functionality in pandas is a powerful tool for data analysis that allows you to:

Group your data based on values in one or more columns
Apply various aggregation functions to different columns
Transform your data while maintaining the original structure
Filter groups based on aggregate conditions

Understanding GroupBy is essential for effective data manipulation and analysis in pandas. It follows the split-apply-combine pattern that is fundamental to data analysis:

Split your data into groups
Apply operations to each group
Combine the results into a new data structure

Exercises

To reinforce your learning, try these exercises:

Create a dataframe with student grades for different subjects and:
- Calculate the average grade by subject
- Find the highest-performing student in each subject
- Compute each student's percentile rank within each subject
Using the sales dataset from our real-world example:
- Identify underperforming products by region
- Calculate quarter-over-quarter growth by product
- Find the most consistent product (lowest standard deviation in sales)
Import a real-world dataset of your choice and apply GroupBy operations to gain insights from the data.

Additional Resources

Happy analyzing!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Creating a Sample Dataset​

Basic GroupBy Operations​

Creating a GroupBy Object​

Basic Aggregations​

Grouping by Multiple Columns​

Accessing Specific Groups​

Common Aggregation Methods​

Custom Aggregation Functions​

Different Aggregations for Different Columns​

Iterating Through Groups​

Transformation and Filtering​

Transformations​

Filtering​

Real-World Example: Sales Analysis​

Summary​

Exercises​

Additional Resources​

Introduction

Prerequisites

Creating a Sample Dataset

Basic GroupBy Operations

Creating a GroupBy Object

Basic Aggregations

Grouping by Multiple Columns

Accessing Specific Groups

Common Aggregation Methods

Custom Aggregation Functions

Different Aggregations for Different Columns

Iterating Through Groups

Transformation and Filtering

Transformations

Filtering

Real-World Example: Sales Analysis

Summary

Exercises

Additional Resources