Pandas Crosstab
Introduction
Cross-tabulation (also known as a contingency table or crosstab) is a statistical technique that displays the frequency distribution of variables in a matrix format. In pandas, the crosstab()
function creates a spreadsheet-style cross-tabulation of two or more variables, making it easier to analyze relationships between categorical variables.
In this tutorial, you'll learn:
- What crosstabs are and why they're useful
- How to create basic and advanced crosstabs using pandas
- How to customize crosstab displays
- Real-world applications of crosstabs in data analysis
Understanding Crosstabs
A crosstab takes two (or more) variables and counts how many times each combination of values appears in your dataset. This creates a table where:
- One variable defines the rows
- Another variable defines the columns
- The cells contain counts (or other aggregated values)
This makes crosstabs perfect for answering questions like:
- "How many male and female customers purchased each product category?"
- "What's the relationship between education level and income bracket?"
- "How do sales vary by region and quarter?"
Basic Crosstab Usage
Let's start with the basic syntax of the pd.crosstab()
function:
pd.crosstab(index, columns, values=None, aggfunc=None, normalize=False)
Where:
index
: Defines the rows of the tablecolumns
: Defines the columns of the tablevalues
: Optional values to aggregate (used withaggfunc
)aggfunc
: Function to aggregatevalues
(e.g., sum, mean)normalize
: Whether to normalize the counts (boolean or string)
Simple Example: Creating a Frequency Table
Let's create a simple dataset and use crosstab:
import pandas as pd
import numpy as np
# Sample data
data = pd.DataFrame({
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'product': ['A', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'B', 'C']
})
# Create a basic crosstab
result = pd.crosstab(data.gender, data.product)
print(result)
Output:
product A B C
gender
Female 2 1 2
Male 2 2 1
This shows how many times each gender purchased each product:
- Female customers bought product A 2 times, product B 1 time, and product C 2 times
- Male customers bought product A 2 times, product B 2 times, and product C 1 time
Advanced Crosstab Features
Normalizing Values
You can normalize the counts to get proportions instead of raw counts. The normalize
parameter accepts:
'index'
: Normalize by row totals'columns'
: Normalize by column totalsTrue
or'all'
: Normalize by total countFalse
: No normalization (default)
# Normalize by row (each row sums to 1)
row_normalized = pd.crosstab(data.gender, data.product, normalize='index')
print("Normalized by row:")
print(row_normalized)
# Normalize by column (each column sums to 1)
col_normalized = pd.crosstab(data.gender, data.product, normalize='columns')
print("\nNormalized by column:")
print(col_normalized)
Output:
Normalized by row:
product A B C
gender
Female 0.400000 0.200000 0.400000
Male 0.400000 0.400000 0.200000
Normalized by column:
product A B C
gender
Female 0.500000 0.333333 0.666667
Male 0.500000 0.666667 0.333333
Using Values and Aggregation Functions
You can provide a values
column and an aggregation function to compute values other than counts:
# Sample data with sales amounts
data_with_sales = pd.DataFrame({
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'product': ['A', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'B', 'C'],
'sales': [100, 150, 200, 120, 80, 220, 110, 90, 160, 190]
})
# Create a crosstab with sum of sales
sales_by_group = pd.crosstab(
data_with_sales.gender,
data_with_sales.product,
values=data_with_sales.sales,
aggfunc=np.sum
)
print(sales_by_group)
Output:
product A B C
gender
Female 210.0 150.0 410.0
Male 300.0 240.0 110.0
Now instead of counts, the cells show the total sales value for each gender-product combination.
Multiple Indices or Columns
Crosstabs can use multiple variables for indices or columns:
# Add age group to data
data_with_age = pd.DataFrame({
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'product': ['A', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'B', 'C'],
'age_group': ['<30', '30-50', '30-50', '<30', '>50', '30-50', '<30', '>50', '>50', '<30']
})
# Create a crosstab with multiple indices
multi_index_crosstab = pd.crosstab(
[data_with_age.gender, data_with_age.age_group],
data_with_age.product
)
print(multi_index_crosstab)
Output:
product A B C
gender age_group
Female <30 1 0 1
30-50 0 1 1
>50 1 0 0
Male <30 0 0 1
30-50 1 0 0
>50 1 2 0
This shows counts for each gender-age_group combination against products.
Customizing Crosstabs
Adding Row and Column Margins
The margins
parameter adds row and column totals:
# Add margins to see totals
margins_crosstab = pd.crosstab(
data.gender,
data.product,
margins=True,
margins_name='Total'
)
print(margins_crosstab)
Output:
product A B C Total
gender
Female 2 1 2 5
Male 2 2 1 5
Total 4 3 3 10
Customizing Row and Column Names
You can customize the name of the indices and columns:
# Customize row and column names
named_crosstab = pd.crosstab(
data.gender,
data.product,
rownames=['Gender'],
colnames=['Product Category']
)
print(named_crosstab)
Output:
Product Category A B C
Gender
Female 2 1 2
Male 2 2 1
Real-World Applications
Example 1: Customer Analysis
Let's analyze customer purchasing patterns across different age groups and regions:
# Create customer purchase data
np.random.seed(42)
n = 1000
customer_data = pd.DataFrame({
'age_group': np.random.choice(['18-25', '26-35', '36-50', '51+'], n),
'region': np.random.choice(['North', 'South', 'East', 'West'], n),
'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], n),
'purchase_amount': np.random.normal(100, 50, n).round(2)
})
# Create a crosstab of purchases by age group and region
purchase_distribution = pd.crosstab(
customer_data.age_group,
customer_data.region,
values=customer_data.purchase_amount,
aggfunc='mean',
margins=True
)
print("Average purchase amount by age group and region:")
print(purchase_distribution)
Output:
region East North South West All
age_group
18-25 96.839842 98.511276 102.200603 104.387373 100.428488
26-35 99.457775 102.335876 98.740064 100.479124 100.344327
36-50 100.264843 96.339301 98.214433 96.858037 97.822673
51+ 101.232739 103.107942 101.055898 103.125178 102.047881
All 99.545644 99.811253 100.010790 101.243924 100.126431
This shows the average purchase amount for each age group across different regions, with totals for both dimensions.
Example 2: Survey Analysis
Let's analyze survey responses across different demographics:
# Survey data
survey_data = pd.DataFrame({
'gender': np.random.choice(['Male', 'Female'], 500),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 500),
'response': np.random.choice(['Very Satisfied', 'Satisfied', 'Neutral', 'Unsatisfied'], 500),
})
# Calculate percentage of each response by gender and education
response_analysis = pd.crosstab(
[survey_data.gender, survey_data.education],
survey_data.response,
normalize='index'
).round(2)
print("Survey response distribution by gender and education (in %):")
print(response_analysis.multiply(100)) # Convert to percentages
Output (sample - will vary due to random data):
response Neutral Satisfied Unsatisfied Very Satisfied
gender education
Female Bachelor 27.0 24.0 23.0 26.0
High School 24.0 29.0 19.0 28.0
Master 25.0 28.0 23.0 24.0
PhD 22.0 27.0 29.0 22.0
Male Bachelor 25.0 24.0 23.0 28.0
High School 24.0 26.0 28.0 22.0
Master 23.0 23.0 24.0 30.0
PhD 29.0 24.0 21.0 26.0
This analysis shows the distribution of survey responses across different gender and education groups.
Summary
Pandas' crosstab()
function is a powerful tool for analyzing relationships between categorical variables. It enables you to:
- Create frequency tables showing counts of variable combinations
- Calculate proportions using normalization
- Aggregate numeric values using custom functions
- Build multi-dimensional tables with multiple variables
- Customize the appearance with margins and labels
Crosstabs are essential for exploratory data analysis, helping you uncover patterns and relationships within your data that might otherwise remain hidden.
Additional Resources and Exercises
Further Reading
Practice Exercises
-
Basic Crosstab: Create a crosstab showing the relationship between weather conditions and ice cream sales.
-
Normalized Crosstab: Using the Titanic dataset, create a normalized crosstab showing the survival rate by passenger class and gender.
-
Advanced Crosstab: For a store's transaction data, create a crosstab that shows the total revenue by product category and day of week, with margins showing totals.
-
Multi-Index Crosstab: Create a crosstab with multiple indices to analyze movie ratings by genre, release decade, and director nationality.
-
Visualization Challenge: Create a crosstab and then visualize it using a heatmap to show the relationship between two categorical variables in a dataset of your choice.
By mastering crosstabs in pandas, you're adding a powerful data analysis technique to your toolkit that will help you better understand relationships within your data.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)