Skip to main content

Pandas Categorical Data

Introduction

When working with data in Python, you'll often encounter variables that take on a limited set of possible values - like days of the week, product categories, or survey responses. While you could represent these as strings or integers, Pandas offers a specialized data type called Categorical that can make your code more efficient and your data analysis more powerful.

In this tutorial, you'll learn how to create, manipulate, and leverage categorical data in Pandas, which can drastically reduce memory usage and speed up operations when working with large datasets.

What Are Categorical Variables?

Categorical variables are variables that can take on only a limited number of possible values. They typically represent qualitative attributes and can be classified into two types:

  1. Nominal categories: Categories with no inherent order (e.g., country names, product types)
  2. Ordinal categories: Categories with a meaningful order (e.g., small/medium/large, satisfaction ratings)

Pandas' Categorical data type is specifically designed to efficiently represent such data.

Creating Categorical Data

Basic Creation

Let's start by creating a simple categorical series:

python
import pandas as pd
import numpy as np

# Create a categorical series from a list
fruits = pd.Series(['apple', 'orange', 'apple', 'apple', 'banana', 'orange'],
dtype='category')
print(fruits)

Output:

0     apple
1 orange
2 apple
3 apple
4 banana
5 orange
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']

Examining Categorical Data

You can examine the properties of your categorical data:

python
# View the categories
print(fruits.cat.categories)

# View the codes (integer representation)
print(fruits.cat.codes)

Output:

Index(['apple', 'banana', 'orange'], dtype='object')
0 0
1 2
2 0
3 0
4 1
5 2
dtype: int8

Notice how pandas represents the categories internally as integers (0, 1, 2), which is much more memory-efficient than storing the full strings.

Creating with Specific Categories and Order

You can create a categorical with a specific set of categories and order:

python
# Create with specific categories and order
sizes = pd.Series(['Medium', 'Large', 'Small', 'Medium', 'Large'],
dtype=pd.CategoricalDtype(categories=['Small', 'Medium', 'Large'],
ordered=True))
print(sizes)

# Compare values
print("Medium < Large:", sizes[0] < sizes[1])

Output:

0    Medium
1 Large
2 Small
3 Medium
4 Large
dtype: category
Categories (3, object): ['Small' < 'Medium' < 'Large']

Medium < Large: True

Benefits of Categorical Data

Memory Efficiency

One of the biggest advantages of categorical data is memory efficiency. Let's compare:

python
# Create a large dataset
N = 1_000_000
text_values = np.random.choice(['apple', 'orange', 'banana', 'pear', 'grape'], N)

# Create as object (string) and categorical series
text_series = pd.Series(text_values)
cat_series = pd.Series(text_values, dtype='category')

# Compare memory usage
print(f"Object dtype: {text_series.memory_usage(deep=True) / 1e6:.2f} MB")
print(f"Category dtype: {cat_series.memory_usage(deep=True) / 1e6:.2f} MB")

Output:

Object dtype: 78.23 MB
Category dtype: 8.06 MB

As you can see, the categorical representation uses significantly less memory!

Performance

Categorical data can also lead to improved performance for certain operations:

python
# Timing comparison for value_counts operation
import time

start = time.time()
text_series.value_counts()
text_time = time.time() - start

start = time.time()
cat_series.value_counts()
cat_time = time.time() - start

print(f"Object dtype: {text_time:.5f} seconds")
print(f"Category dtype: {cat_time:.5f} seconds")

Output:

Object dtype: 0.04532 seconds
Category dtype: 0.00128 seconds

Working with Categorical Data

Converting Existing Data

You can convert existing columns to categorical type:

python
# Create a DataFrame with string data
df = pd.DataFrame({
'fruit': ['apple', 'orange', 'apple', 'banana'] * 2,
'size': ['small', 'large', 'medium', 'small'] * 2,
'taste': ['sweet', 'tangy', 'sweet', 'sweet'] * 2
})

# Convert multiple columns to categorical
df[['fruit', 'size', 'taste']] = df[['fruit', 'size', 'taste']].astype('category')
print(df.dtypes)

Output:

fruit    category
size category
taste category
dtype: object

Adding and Removing Categories

You can add or remove categories as needed:

python
# Original categories
print("Original categories:", df['fruit'].cat.categories.tolist())

# Add new categories
df['fruit'] = df['fruit'].cat.add_categories(['grape', 'melon'])
print("After adding:", df['fruit'].cat.categories.tolist())

# Remove categories
df['fruit'] = df['fruit'].cat.remove_categories(['orange'])
print("After removing:", df['fruit'].cat.categories.tolist())
print(df)

Output:

Original categories: ['apple', 'banana', 'orange']
After adding: ['apple', 'banana', 'orange', 'grape', 'melon']
After removing: ['apple', 'banana', 'grape', 'melon']
fruit size taste
0 apple small sweet
1 NaN large tangy
2 apple medium sweet
3 banana small sweet
4 apple small sweet
5 NaN large tangy
6 apple medium sweet
7 banana small sweet

Notice that 'orange' values have become NaN since we removed that category.

Changing Categories and Orders

You can also rename categories or change their order:

python
# Rename categories
df['size'] = df['size'].cat.rename_categories({'small': 'S', 'medium': 'M', 'large': 'L'})
print(df['size'].cat.categories)

# Set categories with new order
df['size'] = df['size'].cat.reorder_categories(['L', 'M', 'S'], ordered=True)
print(df['size'])
print("S < M:", df['size'][0] < df['size'][2])

Output:

Index(['S', 'M', 'L'], dtype='object')
0 S
1 L
2 M
3 S
4 S
5 L
6 M
7 S
Name: size, dtype: category
Categories (3, object): ['L' < 'M' < 'S']

S < M: False

Practical Applications

Encoding for Machine Learning

Categorical data is often used in machine learning. Let's see how to convert categories to dummy variables:

python
# Create dummy variables for machine learning
dummies = pd.get_dummies(df['fruit'])
print(dummies.head())

Output:

   apple  banana  grape  melon
0 1 0 0 0
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 0 0 0

Sorting Based on Categories

Ordered categories enable meaningful sorting:

python
# Create ratings data
ratings = pd.Series(['Good', 'Poor', 'Excellent', 'Good'],
dtype=pd.CategoricalDtype(categories=['Poor', 'Good', 'Excellent'],
ordered=True))

# Create a DataFrame with ratings
reviews = pd.DataFrame({
'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
'rating': ratings
})

# Sort by ratings
sorted_reviews = reviews.sort_values('rating')
print(sorted_reviews)

Output:

    product    rating
1 Widget B Poor
0 Widget A Good
3 Widget D Good
2 Widget C Excellent

Grouping and Aggregation

Categorical data works well with group-by operations:

python
# Create sales data
sales = pd.DataFrame({
'product_category': pd.Series(['Electronics', 'Clothing', 'Electronics', 'Home', 'Clothing'],
dtype='category'),
'sales_amount': [1200, 800, 1500, 950, 1100]
})

# Group by category and calculate statistics
category_stats = sales.groupby('product_category').agg({
'sales_amount': ['sum', 'mean', 'count']
})
print(category_stats)

Output:

                 sales_amount                
sum mean count
product_category
Clothing 1900 950.0 2
Electronics 2700 1350.0 2
Home 950 950.0 1

Best Practices for Categorical Data

Here are some tips for working effectively with categorical data:

  1. Use categorical type for columns with repeated values - especially for string columns with many duplicates
  2. Specify category order when the categories have a meaningful sequence
  3. Consider memory usage - convert high-cardinality columns (with many unique values) to categorical only if those values are repeated frequently
  4. Leverage categorical columns for faster groupby operations
  5. For machine learning, remember to convert categorical variables appropriately

Summary

Pandas' Categorical data type offers a powerful way to work with limited-value variables, providing:

  • Memory efficiency by representing repeated values as integers
  • Semantic meaning through explicit labeling of categories
  • Ordered operations for comparison and sorting
  • Performance benefits for aggregations and other operations

When working with datasets containing repeated text values or a limited set of options, consider using the categorical data type to make your pandas code more efficient and your intentions more explicit.

Exercises

  1. Create a categorical series for days of the week that preserves their natural order.
  2. Convert a DataFrame column containing country names to a categorical type and check the memory savings.
  3. Create an ordered categorical for T-shirt sizes (XS, S, M, L, XL) and demonstrate comparison operations.
  4. Practice adding, removing, and reordering categories in a categorical series.
  5. Create a DataFrame with multiple categorical columns and experiment with groupby operations.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)