Pandas Categorical Data
Introduction
When working with data in Python, you'll often encounter variables that take on a limited set of possible values - like days of the week, product categories, or survey responses. While you could represent these as strings or integers, Pandas offers a specialized data type called Categorical
that can make your code more efficient and your data analysis more powerful.
In this tutorial, you'll learn how to create, manipulate, and leverage categorical data in Pandas, which can drastically reduce memory usage and speed up operations when working with large datasets.
What Are Categorical Variables?
Categorical variables are variables that can take on only a limited number of possible values. They typically represent qualitative attributes and can be classified into two types:
- Nominal categories: Categories with no inherent order (e.g., country names, product types)
- Ordinal categories: Categories with a meaningful order (e.g., small/medium/large, satisfaction ratings)
Pandas' Categorical
data type is specifically designed to efficiently represent such data.
Creating Categorical Data
Basic Creation
Let's start by creating a simple categorical series:
import pandas as pd
import numpy as np
# Create a categorical series from a list
fruits = pd.Series(['apple', 'orange', 'apple', 'apple', 'banana', 'orange'],
dtype='category')
print(fruits)
Output:
0 apple
1 orange
2 apple
3 apple
4 banana
5 orange
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']
Examining Categorical Data
You can examine the properties of your categorical data:
# View the categories
print(fruits.cat.categories)
# View the codes (integer representation)
print(fruits.cat.codes)
Output:
Index(['apple', 'banana', 'orange'], dtype='object')
0 0
1 2
2 0
3 0
4 1
5 2
dtype: int8
Notice how pandas represents the categories internally as integers (0, 1, 2), which is much more memory-efficient than storing the full strings.
Creating with Specific Categories and Order
You can create a categorical with a specific set of categories and order:
# Create with specific categories and order
sizes = pd.Series(['Medium', 'Large', 'Small', 'Medium', 'Large'],
dtype=pd.CategoricalDtype(categories=['Small', 'Medium', 'Large'],
ordered=True))
print(sizes)
# Compare values
print("Medium < Large:", sizes[0] < sizes[1])
Output:
0 Medium
1 Large
2 Small
3 Medium
4 Large
dtype: category
Categories (3, object): ['Small' < 'Medium' < 'Large']
Medium < Large: True
Benefits of Categorical Data
Memory Efficiency
One of the biggest advantages of categorical data is memory efficiency. Let's compare:
# Create a large dataset
N = 1_000_000
text_values = np.random.choice(['apple', 'orange', 'banana', 'pear', 'grape'], N)
# Create as object (string) and categorical series
text_series = pd.Series(text_values)
cat_series = pd.Series(text_values, dtype='category')
# Compare memory usage
print(f"Object dtype: {text_series.memory_usage(deep=True) / 1e6:.2f} MB")
print(f"Category dtype: {cat_series.memory_usage(deep=True) / 1e6:.2f} MB")
Output:
Object dtype: 78.23 MB
Category dtype: 8.06 MB
As you can see, the categorical representation uses significantly less memory!
Performance
Categorical data can also lead to improved performance for certain operations:
# Timing comparison for value_counts operation
import time
start = time.time()
text_series.value_counts()
text_time = time.time() - start
start = time.time()
cat_series.value_counts()
cat_time = time.time() - start
print(f"Object dtype: {text_time:.5f} seconds")
print(f"Category dtype: {cat_time:.5f} seconds")
Output:
Object dtype: 0.04532 seconds
Category dtype: 0.00128 seconds
Working with Categorical Data
Converting Existing Data
You can convert existing columns to categorical type:
# Create a DataFrame with string data
df = pd.DataFrame({
'fruit': ['apple', 'orange', 'apple', 'banana'] * 2,
'size': ['small', 'large', 'medium', 'small'] * 2,
'taste': ['sweet', 'tangy', 'sweet', 'sweet'] * 2
})
# Convert multiple columns to categorical
df[['fruit', 'size', 'taste']] = df[['fruit', 'size', 'taste']].astype('category')
print(df.dtypes)
Output:
fruit category
size category
taste category
dtype: object
Adding and Removing Categories
You can add or remove categories as needed:
# Original categories
print("Original categories:", df['fruit'].cat.categories.tolist())
# Add new categories
df['fruit'] = df['fruit'].cat.add_categories(['grape', 'melon'])
print("After adding:", df['fruit'].cat.categories.tolist())
# Remove categories
df['fruit'] = df['fruit'].cat.remove_categories(['orange'])
print("After removing:", df['fruit'].cat.categories.tolist())
print(df)
Output:
Original categories: ['apple', 'banana', 'orange']
After adding: ['apple', 'banana', 'orange', 'grape', 'melon']
After removing: ['apple', 'banana', 'grape', 'melon']
fruit size taste
0 apple small sweet
1 NaN large tangy
2 apple medium sweet
3 banana small sweet
4 apple small sweet
5 NaN large tangy
6 apple medium sweet
7 banana small sweet
Notice that 'orange' values have become NaN
since we removed that category.
Changing Categories and Orders
You can also rename categories or change their order:
# Rename categories
df['size'] = df['size'].cat.rename_categories({'small': 'S', 'medium': 'M', 'large': 'L'})
print(df['size'].cat.categories)
# Set categories with new order
df['size'] = df['size'].cat.reorder_categories(['L', 'M', 'S'], ordered=True)
print(df['size'])
print("S < M:", df['size'][0] < df['size'][2])
Output:
Index(['S', 'M', 'L'], dtype='object')
0 S
1 L
2 M
3 S
4 S
5 L
6 M
7 S
Name: size, dtype: category
Categories (3, object): ['L' < 'M' < 'S']
S < M: False
Practical Applications
Encoding for Machine Learning
Categorical data is often used in machine learning. Let's see how to convert categories to dummy variables:
# Create dummy variables for machine learning
dummies = pd.get_dummies(df['fruit'])
print(dummies.head())
Output:
apple banana grape melon
0 1 0 0 0
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 0 0 0
Sorting Based on Categories
Ordered categories enable meaningful sorting:
# Create ratings data
ratings = pd.Series(['Good', 'Poor', 'Excellent', 'Good'],
dtype=pd.CategoricalDtype(categories=['Poor', 'Good', 'Excellent'],
ordered=True))
# Create a DataFrame with ratings
reviews = pd.DataFrame({
'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
'rating': ratings
})
# Sort by ratings
sorted_reviews = reviews.sort_values('rating')
print(sorted_reviews)
Output:
product rating
1 Widget B Poor
0 Widget A Good
3 Widget D Good
2 Widget C Excellent
Grouping and Aggregation
Categorical data works well with group-by operations:
# Create sales data
sales = pd.DataFrame({
'product_category': pd.Series(['Electronics', 'Clothing', 'Electronics', 'Home', 'Clothing'],
dtype='category'),
'sales_amount': [1200, 800, 1500, 950, 1100]
})
# Group by category and calculate statistics
category_stats = sales.groupby('product_category').agg({
'sales_amount': ['sum', 'mean', 'count']
})
print(category_stats)
Output:
sales_amount
sum mean count
product_category
Clothing 1900 950.0 2
Electronics 2700 1350.0 2
Home 950 950.0 1
Best Practices for Categorical Data
Here are some tips for working effectively with categorical data:
- Use categorical type for columns with repeated values - especially for string columns with many duplicates
- Specify category order when the categories have a meaningful sequence
- Consider memory usage - convert high-cardinality columns (with many unique values) to categorical only if those values are repeated frequently
- Leverage categorical columns for faster groupby operations
- For machine learning, remember to convert categorical variables appropriately
Summary
Pandas' Categorical
data type offers a powerful way to work with limited-value variables, providing:
- Memory efficiency by representing repeated values as integers
- Semantic meaning through explicit labeling of categories
- Ordered operations for comparison and sorting
- Performance benefits for aggregations and other operations
When working with datasets containing repeated text values or a limited set of options, consider using the categorical data type to make your pandas code more efficient and your intentions more explicit.
Exercises
- Create a categorical series for days of the week that preserves their natural order.
- Convert a DataFrame column containing country names to a categorical type and check the memory savings.
- Create an ordered categorical for T-shirt sizes (XS, S, M, L, XL) and demonstrate comparison operations.
- Practice adding, removing, and reordering categories in a categorical series.
- Create a DataFrame with multiple categorical columns and experiment with groupby operations.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)