Pandas Crosstab

Introduction

Cross-tabulation (also known as a contingency table or crosstab) is a statistical technique that displays the frequency distribution of variables in a matrix format. In pandas, the crosstab() function creates a spreadsheet-style cross-tabulation of two or more variables, making it easier to analyze relationships between categorical variables.

In this tutorial, you'll learn:

What crosstabs are and why they're useful
How to create basic and advanced crosstabs using pandas
How to customize crosstab displays
Real-world applications of crosstabs in data analysis

Understanding Crosstabs

A crosstab takes two (or more) variables and counts how many times each combination of values appears in your dataset. This creates a table where:

One variable defines the rows
Another variable defines the columns
The cells contain counts (or other aggregated values)

This makes crosstabs perfect for answering questions like:

"How many male and female customers purchased each product category?"
"What's the relationship between education level and income bracket?"
"How do sales vary by region and quarter?"

Basic Crosstab Usage

Let's start with the basic syntax of the pd.crosstab() function:

pd.crosstab(index, columns, values=None, aggfunc=None, normalize=False)

Where:

index: Defines the rows of the table
columns: Defines the columns of the table
values: Optional values to aggregate (used with aggfunc)
aggfunc: Function to aggregate values (e.g., sum, mean)
normalize: Whether to normalize the counts (boolean or string)

Simple Example: Creating a Frequency Table

Let's create a simple dataset and use crosstab:

import pandas as pd
import numpy as np

# Sample data
data = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'product': ['A', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'B', 'C']
})

# Create a basic crosstab
result = pd.crosstab(data.gender, data.product)
print(result)

Output:

product  A  B  C
gender         
Female   2  1  2
Male     2  2  1

This shows how many times each gender purchased each product:

Female customers bought product A 2 times, product B 1 time, and product C 2 times
Male customers bought product A 2 times, product B 2 times, and product C 1 time

Advanced Crosstab Features

Normalizing Values

You can normalize the counts to get proportions instead of raw counts. The normalize parameter accepts:

'index': Normalize by row totals
'columns': Normalize by column totals
True or 'all': Normalize by total count
False: No normalization (default)

# Normalize by row (each row sums to 1)
row_normalized = pd.crosstab(data.gender, data.product, normalize='index')
print("Normalized by row:")
print(row_normalized)

# Normalize by column (each column sums to 1)
col_normalized = pd.crosstab(data.gender, data.product, normalize='columns')
print("\nNormalized by column:")
print(col_normalized)

Output:

Normalized by row:
product         A         B         C
gender                              
Female   0.400000  0.200000  0.400000
Male     0.400000  0.400000  0.200000

Normalized by column:
product         A         B         C
gender                              
Female   0.500000  0.333333  0.666667
Male     0.500000  0.666667  0.333333

Using Values and Aggregation Functions

You can provide a values column and an aggregation function to compute values other than counts:

# Sample data with sales amounts
data_with_sales = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'product': ['A', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'B', 'C'],
    'sales': [100, 150, 200, 120, 80, 220, 110, 90, 160, 190]
})

# Create a crosstab with sum of sales
sales_by_group = pd.crosstab(
    data_with_sales.gender, 
    data_with_sales.product, 
    values=data_with_sales.sales, 
    aggfunc=np.sum
)
print(sales_by_group)

Output:

product      A      B      C
gender                     
Female   210.0  150.0  410.0
Male     300.0  240.0  110.0

Now instead of counts, the cells show the total sales value for each gender-product combination.

Multiple Indices or Columns

Crosstabs can use multiple variables for indices or columns:

# Add age group to data
data_with_age = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'product': ['A', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'B', 'C'],
    'age_group': ['<30', '30-50', '30-50', '<30', '>50', '30-50', '<30', '>50', '>50', '<30']
})

# Create a crosstab with multiple indices
multi_index_crosstab = pd.crosstab(
    [data_with_age.gender, data_with_age.age_group],
    data_with_age.product
)
print(multi_index_crosstab)

Output:

product               A  B  C
gender age_group            
Female <30           1  0  1
       30-50         0  1  1
       >50           1  0  0
Male   <30           0  0  1
       30-50         1  0  0
       >50           1  2  0

This shows counts for each gender-age_group combination against products.

Customizing Crosstabs

Adding Row and Column Margins

The margins parameter adds row and column totals:

# Add margins to see totals
margins_crosstab = pd.crosstab(
    data.gender, 
    data.product, 
    margins=True, 
    margins_name='Total'
)
print(margins_crosstab)

Output:

product    A  B  C  Total
gender                   
Female     2  1  2      5
Male       2  2  1      5
Total      4  3  3     10

Customizing Row and Column Names

You can customize the name of the indices and columns:

# Customize row and column names
named_crosstab = pd.crosstab(
    data.gender, 
    data.product,
    rownames=['Gender'],
    colnames=['Product Category']
)
print(named_crosstab)

Output:

Product Category  A  B  C
Gender                  
Female            2  1  2
Male              2  2  1

Real-World Applications

Example 1: Customer Analysis

Let's analyze customer purchasing patterns across different age groups and regions:

# Create customer purchase data
np.random.seed(42)
n = 1000
customer_data = pd.DataFrame({
    'age_group': np.random.choice(['18-25', '26-35', '36-50', '51+'], n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], n),
    'purchase_amount': np.random.normal(100, 50, n).round(2)
})

# Create a crosstab of purchases by age group and region
purchase_distribution = pd.crosstab(
    customer_data.age_group,
    customer_data.region,
    values=customer_data.purchase_amount,
    aggfunc='mean',
    margins=True
)

print("Average purchase amount by age group and region:")
print(purchase_distribution)

Output:

region        East       North       South        West         All
age_group                                                        
18-25     96.839842   98.511276  102.200603  104.387373  100.428488
26-35     99.457775  102.335876   98.740064  100.479124  100.344327
36-50    100.264843   96.339301   98.214433   96.858037   97.822673
51+      101.232739  103.107942  101.055898  103.125178  102.047881
All       99.545644   99.811253  100.010790  101.243924  100.126431

This shows the average purchase amount for each age group across different regions, with totals for both dimensions.

Example 2: Survey Analysis

Let's analyze survey responses across different demographics:

# Survey data
survey_data = pd.DataFrame({
    'gender': np.random.choice(['Male', 'Female'], 500),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 500),
    'response': np.random.choice(['Very Satisfied', 'Satisfied', 'Neutral', 'Unsatisfied'], 500),
})

# Calculate percentage of each response by gender and education
response_analysis = pd.crosstab(
    [survey_data.gender, survey_data.education],
    survey_data.response,
    normalize='index'
).round(2)

print("Survey response distribution by gender and education (in %):")
print(response_analysis.multiply(100))  # Convert to percentages

Output (sample - will vary due to random data):

response               Neutral  Satisfied  Unsatisfied  Very Satisfied
gender education                                                      
Female Bachelor           27.0      24.0         23.0            26.0
       High School        24.0      29.0         19.0            28.0
       Master             25.0      28.0         23.0            24.0
       PhD                22.0      27.0         29.0            22.0
Male   Bachelor           25.0      24.0         23.0            28.0
       High School        24.0      26.0         28.0            22.0
       Master             23.0      23.0         24.0            30.0
       PhD                29.0      24.0         21.0            26.0

This analysis shows the distribution of survey responses across different gender and education groups.

Summary

Pandas' crosstab() function is a powerful tool for analyzing relationships between categorical variables. It enables you to:

Create frequency tables showing counts of variable combinations
Calculate proportions using normalization
Aggregate numeric values using custom functions
Build multi-dimensional tables with multiple variables
Customize the appearance with margins and labels

Crosstabs are essential for exploratory data analysis, helping you uncover patterns and relationships within your data that might otherwise remain hidden.

Additional Resources and Exercises

Practice Exercises

Basic Crosstab: Create a crosstab showing the relationship between weather conditions and ice cream sales.
Normalized Crosstab: Using the Titanic dataset, create a normalized crosstab showing the survival rate by passenger class and gender.
Advanced Crosstab: For a store's transaction data, create a crosstab that shows the total revenue by product category and day of week, with margins showing totals.
Multi-Index Crosstab: Create a crosstab with multiple indices to analyze movie ratings by genre, release decade, and director nationality.
Visualization Challenge: Create a crosstab and then visualize it using a heatmap to show the relationship between two categorical variables in a dataset of your choice.

By mastering crosstabs in pandas, you're adding a powerful data analysis technique to your toolkit that will help you better understand relationships within your data.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Crosstabs​

Basic Crosstab Usage​

Simple Example: Creating a Frequency Table​

Advanced Crosstab Features​

Normalizing Values​

Using Values and Aggregation Functions​

Multiple Indices or Columns​

Customizing Crosstabs​

Adding Row and Column Margins​

Customizing Row and Column Names​

Real-World Applications​

Example 1: Customer Analysis​

Example 2: Survey Analysis​

Summary​

Additional Resources and Exercises​

Further Reading​

Practice Exercises​

Introduction

Understanding Crosstabs

Basic Crosstab Usage

Simple Example: Creating a Frequency Table

Advanced Crosstab Features

Normalizing Values

Using Values and Aggregation Functions

Multiple Indices or Columns

Customizing Crosstabs

Adding Row and Column Margins

Customizing Row and Column Names

Real-World Applications

Example 1: Customer Analysis

Example 2: Survey Analysis

Summary

Additional Resources and Exercises

Further Reading

Practice Exercises