Pandas Categorical Encoding

Introduction

Categorical data is ubiquitous in real-world datasets: gender, country, color, product categories, education levels, and so on. However, most machine learning algorithms require numerical input. This is where categorical encoding comes in - the process of converting categorical variables into a numerical format that algorithms can work with.

In this tutorial, we'll explore different methods in pandas to encode categorical data and when to use each approach. We'll cover:

Understanding categorical data
Label encoding
One-hot encoding
Ordinal encoding
Binary encoding
Other advanced encoding techniques

Understanding Categorical Data

Before diving into encoding techniques, let's understand the types of categorical data:

Nominal categories: Categories with no inherent order (e.g., colors, countries)
Ordinal categories: Categories with a meaningful order (e.g., education levels, rating scales)

Let's create a simple dataset to work with:

import pandas as pd
import numpy as np

# Sample data
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Large', 'Medium', 'Small', 'Large', 'Medium'],
    'Rating': [4, 3, 5, 2, 3],
    'Price': [1200, 800, 500, 1300, 750]
}

df = pd.DataFrame(data)
print(df)

Output:

  Product Color   Size  Rating  Price
Laptop   Red  Large       4   1200
 Phone  Blue Medium       3    800
Tablet Green  Small       5    500
Laptop  Blue  Large       2   1300
 Phone   Red Medium       3    750

Label Encoding

Label encoding replaces each category with a unique integer. This is the simplest form of encoding.

Using pandas `factorize()`

# Label encoding with factorize()
df['Product_encoded'], _ = pd.factorize(df['Product'])
df['Color_encoded'], _ = pd.factorize(df['Color'])
print(df[['Product', 'Product_encoded', 'Color', 'Color_encoded']])

Output:

  Product  Product_encoded Color  Color_encoded
Laptop                0   Red              0
 Phone                1  Blue              1
Tablet                2 Green              2
Laptop                0  Blue              1
 Phone                1   Red              0

Using scikit-learn

For more advanced encoding that can be applied easily to test data, we can use scikit-learn:

from sklearn.preprocessing import LabelEncoder

# Initialize the encoder
le = LabelEncoder()

# Fit and transform the 'Size' column
df['Size_encoded'] = le.fit_transform(df['Size'])
print(df[['Size', 'Size_encoded']])

Output:

     Size  Size_encoded
 Large            1
Medium            2
 Small            0
 Large            1
Medium            2

When to use Label Encoding:

When the categorical variable is ordinal (has a meaningful order)
When the algorithm can handle numerical relationships (like decision trees)
When you have high-cardinality features (many unique categories)

Limitations:

Creates a false sense of ordering in nominal categories
The model might interpret higher numbers as "more important"

One-Hot Encoding

One-hot encoding creates binary columns for each category, which avoids the ordinal relationship issue in label encoding.

Using pandas `get_dummies()`

# One-hot encoding with get_dummies()
color_dummies = pd.get_dummies(df['Color'], prefix='Color')
print(color_dummies)

# Adding the encoded columns to the original dataframe
df_with_dummies = pd.concat([df, color_dummies], axis=1)
print(df_with_dummies.head())

Output:

   Color_Blue  Color_Green  Color_Red
         0            0          1
         1            0          0
         0            1          0
         1            0          0
         0            0          1

  Product Color   Size  Rating  Price  Color_Blue  Color_Green  Color_Red
Laptop   Red  Large       4   1200           0            0          1
 Phone  Blue Medium       3    800           1            0          0
Tablet Green  Small       5    500           0            1          0
Laptop  Blue  Large       2   1300           1            0          0
 Phone   Red Medium       3    750           0            0          1

One-Hot Encoding Multiple Columns

# One-hot encoding multiple columns at once
df_encoded = pd.get_dummies(df, columns=['Product', 'Color', 'Size'])
print(df_encoded.head())

Output (abbreviated):

   Rating  Price  Product_Laptop  Product_Phone  Product_Tablet  Color_Blue  ...
0       4   1200              1              0               0           0  ...
1       3    800              0              1               0           1  ...
...

When to use One-Hot Encoding:

For nominal categorical features without ordinal relationships
When the algorithm doesn't innately handle categorical data (like linear regression, SVM)
When the number of categories is relatively small

Limitations:

Creates many columns for high-cardinality features
Can lead to the "curse of dimensionality"

Ordinal Encoding

Ordinal encoding is appropriate when categories have a meaningful order.

# Define an ordering for the 'Size' column
size_ordering = {'Small': 1, 'Medium': 2, 'Large': 3}

# Apply the mapping
df['Size_ordinal'] = df['Size'].map(size_ordering)
print(df[['Size', 'Size_ordinal']])

Output:

     Size  Size_ordinal
 Large             3
Medium             2
 Small             1
 Large             3
Medium             2

Using scikit-learn's OrdinalEncoder:

from sklearn.preprocessing import OrdinalEncoder

# Define the categories in order
size_categories = [['Small', 'Medium', 'Large']]  # Note the nested list

# Initialize the encoder
ord_encoder = OrdinalEncoder(categories=size_categories)

# Reshape for scikit-learn's API and transform
df['Size_ordinal_sklearn'] = ord_encoder.fit_transform(df[['Size']])
print(df[['Size', 'Size_ordinal_sklearn']])

Output:

     Size  Size_ordinal_sklearn
 Large                   2.0
Medium                   1.0
 Small                   0.0
 Large                   2.0
Medium                   1.0

When to use Ordinal Encoding:

When the categorical variable has a clear, meaningful order
When the ordering relationship matters for the analysis or model

Binary Encoding

Binary encoding is a more space-efficient alternative to one-hot encoding for high-cardinality features. It represents each integer as its binary representation.

# We'll need category-encoders package
# !pip install category-encoders

import category_encoders as ce

# Initialize the encoder
binary_encoder = ce.BinaryEncoder(cols=['Product'])

# Apply binary encoding
df_binary = binary_encoder.fit_transform(df)
print(df_binary[['Product', 'Product_0', 'Product_1']].head())

Output:

  Product  Product_0  Product_1
Laptop          0          0
 Phone          1          0
Tablet          0          1
Laptop          0          0
 Phone          1          0

When to use Binary Encoding:

For high-cardinality nominal features
When you want a balance between one-hot encoding and label encoding
To reduce dimensionality compared to one-hot encoding

Practical Examples

Example 1: Preparing Data for a Machine Learning Model

Let's prepare a dataset for a machine learning model to predict prices:

# More realistic dataset
housing_data = {
    'neighborhood': ['Downtown', 'Suburb', 'Rural', 'Downtown', 'Suburb', 'Rural'],
    'house_type': ['Apartment', 'House', 'House', 'Apartment', 'House', 'House'],
    'size_category': ['Small', 'Medium', 'Large', 'Small', 'Large', 'Medium'],
    'price': [250000, 320000, 180000, 230000, 380000, 210000]
}

housing_df = pd.DataFrame(housing_data)
print("Original dataset:")
print(housing_df)

# Encoding for machine learning
# One-hot encode neighborhood and house_type (nominal)
nominal_encoded = pd.get_dummies(housing_df, columns=['neighborhood', 'house_type'])

# Ordinal encode size_category
size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
nominal_encoded['size_encoded'] = nominal_encoded['size_category'].map(size_order)

# Final dataframe ready for modeling
ml_ready_df = nominal_encoded.drop('size_category', axis=1)
print("\nDataset ready for modeling:")
print(ml_ready_df.head())

Output:

Original dataset:
  neighborhood house_type size_category   price
   Downtown  Apartment        Small  250000
     Suburb      House       Medium  320000
      Rural      House        Large  180000
   Downtown  Apartment        Small  230000
     Suburb      House        Large  380000
      Rural      House       Medium  210000

Dataset ready for modeling:
    price  neighborhood_Downtown  neighborhood_Rural  neighborhood_Suburb  \
250000                      1                   0                    0   
320000                      0                   0                    1   
180000                      0                   1                    0   
230000                      1                   0                    0   
380000                      0                   0                    1   
210000                      0                   1                    0   

   house_type_Apartment  house_type_House  size_encoded  
                   1                 0             1  
                   0                 1             2  
                   0                 1             3  
                   1                 0             1  
                   0                 1             3  
                   0                 1             2  

Example 2: Handling Missing Values in Categorical Data

# Dataset with missing values
missing_data = {
    'product': ['Laptop', 'Phone', None, 'Tablet', 'Phone'],
    'category': ['Electronics', 'Electronics', 'Clothing', None, 'Electronics'],
    'price': [1200, 800, 50, 500, 750]
}

missing_df = pd.DataFrame(missing_data)
print("Dataset with missing values:")
print(missing_df)

# Fill missing values before encoding
missing_df['product'].fillna('Unknown', inplace=True)
missing_df['category'].fillna('Unknown', inplace=True)

# Encode data with get_dummies
# The dummy_na parameter creates an additional column for missing values
encoded_missing = pd.get_dummies(missing_df, columns=['product', 'category'], dummy_na=False)
print("\nEncoded dataset:")
print(encoded_missing)

Output:

Dataset with missing values:
  product     category  price
Laptop  Electronics   1200
 Phone  Electronics    800
  None     Clothing     50
Tablet         None    500
 Phone  Electronics    750

Encoded dataset:
   price  product_Laptop  product_Phone  product_Tablet  product_Unknown  \
 1200              1              0               0                0   
  800              0              1               0                0   
   50              0              0               0                1   
  500              0              0               1                0   
  750              0              1               0                0   

   category_Clothing  category_Electronics  category_Unknown  
                0                     1                 0  
                0                     1                 0  
                1                     0                 0  
                0                     0                 1  
                0                     1                 0  

Example 3: Encoding for Time Series Data

# Time series data with day of week
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
ts_data = {
    'date': dates,
    'value': np.random.randn(10) * 10 + 100
}

ts_df = pd.DataFrame(ts_data)

# Extract day of week
ts_df['day_of_week'] = ts_df['date'].dt.day_name()
print("Time series data:")
print(ts_df)

# Cyclical encoding for day of week (especially useful for time features)
# Map days to numbers 0-6
days_mapping = {
    'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3, 
    'Friday': 4, 'Saturday': 5, 'Sunday': 6
}
ts_df['day_num'] = ts_df['day_of_week'].map(days_mapping)

# Create cyclical features using sine and cosine transformations
ts_df['day_sin'] = np.sin(ts_df['day_num'] * (2 * np.pi / 7))
ts_df['day_cos'] = np.cos(ts_df['day_num'] * (2 * np.pi / 7))

print("\nTime series with cyclical encoding:")
print(ts_df[['date', 'day_of_week', 'day_num', 'day_sin', 'day_cos']])

Output:

Time series data:
        date       value day_of_week
0 2023-01-01  101.356352      Sunday
1 2023-01-02  109.154123      Monday
...

Time series with cyclical encoding:
        date day_of_week  day_num    day_sin    day_cos
0 2023-01-01      Sunday        6  0.781831  0.623490
1 2023-01-02      Monday        0  0.000000  1.000000
...

Choosing the Right Encoding Method

Choosing the right encoding method is crucial for your analysis. Here's a quick guide:

Encoding Method	Good For	Limitations
Label Encoding	Ordinal variables, tree-based algorithms	Creates false ordering for nominal variables
One-Hot Encoding	Nominal variables, linear models	Increases dimensionality, problematic with high cardinality
Ordinal Encoding	Variables with clear ordering	Requires domain knowledge to establish order
Binary Encoding	High cardinality features	More complex than simpler methods
Cyclical Encoding	Cyclical features (time, directions)	Only applicable to special cases

Summary

Categorical encoding is a fundamental step in data preprocessing. In this tutorial, we've covered:

Label encoding for converting categories to integers
One-hot encoding for creating binary columns
Ordinal encoding for ordered categorical variables
Binary encoding for efficient representation of high-cardinality features
Practical examples showing how to apply these techniques

Remember that the choice of encoding method should depend on:

The nature of the categorical variable (nominal vs. ordinal)
The machine learning algorithm you're using
The cardinality (number of unique values) of the feature

Additional Resources and Exercises

Resources

Exercises

Basic Encoding: Create a dataset with at least 3 categorical features and apply label encoding to all of them.
Mixed Encoding: Create a dataset with both nominal and ordinal features. Apply the appropriate encoding method to each.
Advanced Challenge: Take a public dataset (like Titanic from Kaggle) and prepare all categorical features using the most appropriate encoding techniques.
Performance Comparison: Compare the performance of a machine learning model (like Random Forest) on the same dataset using different encoding techniques.
Custom Encoder: Implement a frequency encoder that replaces categories with their frequency in the dataset.

Happy encoding!

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Understanding Categorical Data​

Label Encoding​

Using pandas factorize()​

Using scikit-learn​

One-Hot Encoding​

Using pandas get_dummies()​

One-Hot Encoding Multiple Columns​

Ordinal Encoding​

Binary Encoding​

Practical Examples​

Example 1: Preparing Data for a Machine Learning Model​

Example 2: Handling Missing Values in Categorical Data​

Example 3: Encoding for Time Series Data​

Choosing the Right Encoding Method​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction

Understanding Categorical Data

Label Encoding

Using pandas `factorize()`

Using scikit-learn

One-Hot Encoding

Using pandas `get_dummies()`

One-Hot Encoding Multiple Columns

Ordinal Encoding

Binary Encoding

Practical Examples

Example 1: Preparing Data for a Machine Learning Model

Example 2: Handling Missing Values in Categorical Data

Example 3: Encoding for Time Series Data

Choosing the Right Encoding Method

Summary

Additional Resources and Exercises

Resources

Exercises