Pandas Categorical Data

Introduction

When working with data in Python, you'll often encounter variables that take on a limited set of possible values - like days of the week, product categories, or survey responses. While you could represent these as strings or integers, Pandas offers a specialized data type called Categorical that can make your code more efficient and your data analysis more powerful.

In this tutorial, you'll learn how to create, manipulate, and leverage categorical data in Pandas, which can drastically reduce memory usage and speed up operations when working with large datasets.

What Are Categorical Variables?

Categorical variables are variables that can take on only a limited number of possible values. They typically represent qualitative attributes and can be classified into two types:

Nominal categories: Categories with no inherent order (e.g., country names, product types)
Ordinal categories: Categories with a meaningful order (e.g., small/medium/large, satisfaction ratings)

Pandas' Categorical data type is specifically designed to efficiently represent such data.

Creating Categorical Data

Basic Creation

Let's start by creating a simple categorical series:

import pandas as pd
import numpy as np

# Create a categorical series from a list
fruits = pd.Series(['apple', 'orange', 'apple', 'apple', 'banana', 'orange'], 
                  dtype='category')
print(fruits)

Output:

   apple
  orange
   apple
   apple
  banana
  orange
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']

Examining Categorical Data

You can examine the properties of your categorical data:

# View the categories
print(fruits.cat.categories)

# View the codes (integer representation)
print(fruits.cat.codes)

Output:

Index(['apple', 'banana', 'orange'], dtype='object')
  0
  2
  0
  0
  1
  2
dtype: int8

Notice how pandas represents the categories internally as integers (0, 1, 2), which is much more memory-efficient than storing the full strings.

Creating with Specific Categories and Order

You can create a categorical with a specific set of categories and order:

# Create with specific categories and order
sizes = pd.Series(['Medium', 'Large', 'Small', 'Medium', 'Large'],
                 dtype=pd.CategoricalDtype(categories=['Small', 'Medium', 'Large'], 
                                           ordered=True))
print(sizes)

# Compare values
print("Medium < Large:", sizes[0] < sizes[1])

Output:

0    Medium
1     Large
2     Small
3    Medium
4     Large
dtype: category
Categories (3, object): ['Small' < 'Medium' < 'Large']

Medium < Large: True

Benefits of Categorical Data

Memory Efficiency

One of the biggest advantages of categorical data is memory efficiency. Let's compare:

# Create a large dataset
N = 1_000_000
text_values = np.random.choice(['apple', 'orange', 'banana', 'pear', 'grape'], N)

# Create as object (string) and categorical series
text_series = pd.Series(text_values)
cat_series = pd.Series(text_values, dtype='category')

# Compare memory usage
print(f"Object dtype: {text_series.memory_usage(deep=True) / 1e6:.2f} MB")
print(f"Category dtype: {cat_series.memory_usage(deep=True) / 1e6:.2f} MB")

Output:

Object dtype: 78.23 MB
Category dtype: 8.06 MB

As you can see, the categorical representation uses significantly less memory!

Performance

Categorical data can also lead to improved performance for certain operations:

# Timing comparison for value_counts operation
import time

start = time.time()
text_series.value_counts()
text_time = time.time() - start

start = time.time()
cat_series.value_counts()
cat_time = time.time() - start

print(f"Object dtype: {text_time:.5f} seconds")
print(f"Category dtype: {cat_time:.5f} seconds")

Output:

Object dtype: 0.04532 seconds
Category dtype: 0.00128 seconds

Working with Categorical Data

Converting Existing Data

You can convert existing columns to categorical type:

# Create a DataFrame with string data
df = pd.DataFrame({
    'fruit': ['apple', 'orange', 'apple', 'banana'] * 2,
    'size': ['small', 'large', 'medium', 'small'] * 2,
    'taste': ['sweet', 'tangy', 'sweet', 'sweet'] * 2
})

# Convert multiple columns to categorical
df[['fruit', 'size', 'taste']] = df[['fruit', 'size', 'taste']].astype('category')
print(df.dtypes)

Output:

fruit    category
size     category
taste    category
dtype: object

Adding and Removing Categories

You can add or remove categories as needed:

# Original categories
print("Original categories:", df['fruit'].cat.categories.tolist())

# Add new categories
df['fruit'] = df['fruit'].cat.add_categories(['grape', 'melon'])
print("After adding:", df['fruit'].cat.categories.tolist())

# Remove categories
df['fruit'] = df['fruit'].cat.remove_categories(['orange'])
print("After removing:", df['fruit'].cat.categories.tolist())
print(df)

Output:

Original categories: ['apple', 'banana', 'orange']
After adding: ['apple', 'banana', 'orange', 'grape', 'melon']
After removing: ['apple', 'banana', 'grape', 'melon']
   fruit    size  taste
0  apple   small  sweet
1    NaN   large  tangy
2  apple  medium  sweet
3  banana   small  sweet
4  apple   small  sweet
5    NaN   large  tangy
6  apple  medium  sweet
7  banana   small  sweet

Notice that 'orange' values have become NaN since we removed that category.

Changing Categories and Orders

You can also rename categories or change their order:

# Rename categories
df['size'] = df['size'].cat.rename_categories({'small': 'S', 'medium': 'M', 'large': 'L'})
print(df['size'].cat.categories)

# Set categories with new order
df['size'] = df['size'].cat.reorder_categories(['L', 'M', 'S'], ordered=True)
print(df['size'])
print("S < M:", df['size'][0] < df['size'][2])

Output:

Index(['S', 'M', 'L'], dtype='object')
0    S
1    L
2    M
3    S
4    S
5    L
6    M
7    S
Name: size, dtype: category
Categories (3, object): ['L' < 'M' < 'S']

S < M: False

Practical Applications

Encoding for Machine Learning

Categorical data is often used in machine learning. Let's see how to convert categories to dummy variables:

# Create dummy variables for machine learning
dummies = pd.get_dummies(df['fruit'])
print(dummies.head())

Output:

   apple  banana  grape  melon
    1       0      0      0
    0       0      0      0
    1       0      0      0
    0       1      0      0
    1       0      0      0

Sorting Based on Categories

Ordered categories enable meaningful sorting:

# Create ratings data
ratings = pd.Series(['Good', 'Poor', 'Excellent', 'Good'],
                   dtype=pd.CategoricalDtype(categories=['Poor', 'Good', 'Excellent'], 
                                           ordered=True))

# Create a DataFrame with ratings
reviews = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'rating': ratings
})

# Sort by ratings
sorted_reviews = reviews.sort_values('rating')
print(sorted_reviews)

Output:

    product    rating
Widget B      Poor
Widget A      Good
Widget D      Good
Widget C  Excellent

Grouping and Aggregation

Categorical data works well with group-by operations:

# Create sales data
sales = pd.DataFrame({
    'product_category': pd.Series(['Electronics', 'Clothing', 'Electronics', 'Home', 'Clothing'], 
                                 dtype='category'),
    'sales_amount': [1200, 800, 1500, 950, 1100]
})

# Group by category and calculate statistics
category_stats = sales.groupby('product_category').agg({
    'sales_amount': ['sum', 'mean', 'count']
})
print(category_stats)

Output:

                 sales_amount                
                        sum   mean count
product_category                          
Clothing                1900  950.0     2
Electronics             2700 1350.0     2
Home                     950  950.0     1

Best Practices for Categorical Data

Here are some tips for working effectively with categorical data:

Use categorical type for columns with repeated values - especially for string columns with many duplicates
Specify category order when the categories have a meaningful sequence
Consider memory usage - convert high-cardinality columns (with many unique values) to categorical only if those values are repeated frequently
Leverage categorical columns for faster groupby operations
For machine learning, remember to convert categorical variables appropriately

Summary

Pandas' Categorical data type offers a powerful way to work with limited-value variables, providing:

Memory efficiency by representing repeated values as integers
Semantic meaning through explicit labeling of categories
Ordered operations for comparison and sorting
Performance benefits for aggregations and other operations

When working with datasets containing repeated text values or a limited set of options, consider using the categorical data type to make your pandas code more efficient and your intentions more explicit.

Exercises

Create a categorical series for days of the week that preserves their natural order.
Convert a DataFrame column containing country names to a categorical type and check the memory savings.
Create an ordered categorical for T-shirt sizes (XS, S, M, L, XL) and demonstrate comparison operations.
Practice adding, removing, and reordering categories in a categorical series.
Create a DataFrame with multiple categorical columns and experiment with groupby operations.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are Categorical Variables?​

Creating Categorical Data​

Basic Creation​

Examining Categorical Data​

Creating with Specific Categories and Order​

Benefits of Categorical Data​

Memory Efficiency​

Performance​

Working with Categorical Data​

Converting Existing Data​

Adding and Removing Categories​

Changing Categories and Orders​

Practical Applications​

Encoding for Machine Learning​

Sorting Based on Categories​

Grouping and Aggregation​

Best Practices for Categorical Data​

Summary​

Exercises​

Additional Resources​

Introduction

What Are Categorical Variables?

Creating Categorical Data

Basic Creation

Examining Categorical Data

Creating with Specific Categories and Order

Benefits of Categorical Data

Memory Efficiency

Performance

Working with Categorical Data

Converting Existing Data

Adding and Removing Categories

Changing Categories and Orders

Practical Applications

Encoding for Machine Learning

Sorting Based on Categories

Grouping and Aggregation

Best Practices for Categorical Data

Summary

Exercises

Additional Resources