Skip to main content

Pandas Categorical Encoding

Introduction

Categorical data is ubiquitous in real-world datasets: gender, country, color, product categories, education levels, and so on. However, most machine learning algorithms require numerical input. This is where categorical encoding comes in - the process of converting categorical variables into a numerical format that algorithms can work with.

In this tutorial, we'll explore different methods in pandas to encode categorical data and when to use each approach. We'll cover:

  • Understanding categorical data
  • Label encoding
  • One-hot encoding
  • Ordinal encoding
  • Binary encoding
  • Other advanced encoding techniques

Understanding Categorical Data

Before diving into encoding techniques, let's understand the types of categorical data:

  1. Nominal categories: Categories with no inherent order (e.g., colors, countries)
  2. Ordinal categories: Categories with a meaningful order (e.g., education levels, rating scales)

Let's create a simple dataset to work with:

python
import pandas as pd
import numpy as np

# Sample data
data = {
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Large', 'Medium', 'Small', 'Large', 'Medium'],
'Rating': [4, 3, 5, 2, 3],
'Price': [1200, 800, 500, 1300, 750]
}

df = pd.DataFrame(data)
print(df)

Output:

  Product Color   Size  Rating  Price
0 Laptop Red Large 4 1200
1 Phone Blue Medium 3 800
2 Tablet Green Small 5 500
3 Laptop Blue Large 2 1300
4 Phone Red Medium 3 750

Label Encoding

Label encoding replaces each category with a unique integer. This is the simplest form of encoding.

Using pandas factorize()

python
# Label encoding with factorize()
df['Product_encoded'], _ = pd.factorize(df['Product'])
df['Color_encoded'], _ = pd.factorize(df['Color'])
print(df[['Product', 'Product_encoded', 'Color', 'Color_encoded']])

Output:

  Product  Product_encoded Color  Color_encoded
0 Laptop 0 Red 0
1 Phone 1 Blue 1
2 Tablet 2 Green 2
3 Laptop 0 Blue 1
4 Phone 1 Red 0

Using scikit-learn

For more advanced encoding that can be applied easily to test data, we can use scikit-learn:

python
from sklearn.preprocessing import LabelEncoder

# Initialize the encoder
le = LabelEncoder()

# Fit and transform the 'Size' column
df['Size_encoded'] = le.fit_transform(df['Size'])
print(df[['Size', 'Size_encoded']])

Output:

     Size  Size_encoded
0 Large 1
1 Medium 2
2 Small 0
3 Large 1
4 Medium 2

When to use Label Encoding:

  • When the categorical variable is ordinal (has a meaningful order)
  • When the algorithm can handle numerical relationships (like decision trees)
  • When you have high-cardinality features (many unique categories)

Limitations:

  • Creates a false sense of ordering in nominal categories
  • The model might interpret higher numbers as "more important"

One-Hot Encoding

One-hot encoding creates binary columns for each category, which avoids the ordinal relationship issue in label encoding.

Using pandas get_dummies()

python
# One-hot encoding with get_dummies()
color_dummies = pd.get_dummies(df['Color'], prefix='Color')
print(color_dummies)

# Adding the encoded columns to the original dataframe
df_with_dummies = pd.concat([df, color_dummies], axis=1)
print(df_with_dummies.head())

Output:

   Color_Blue  Color_Green  Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1

Product Color Size Rating Price Color_Blue Color_Green Color_Red
0 Laptop Red Large 4 1200 0 0 1
1 Phone Blue Medium 3 800 1 0 0
2 Tablet Green Small 5 500 0 1 0
3 Laptop Blue Large 2 1300 1 0 0
4 Phone Red Medium 3 750 0 0 1

One-Hot Encoding Multiple Columns

python
# One-hot encoding multiple columns at once
df_encoded = pd.get_dummies(df, columns=['Product', 'Color', 'Size'])
print(df_encoded.head())

Output (abbreviated):

   Rating  Price  Product_Laptop  Product_Phone  Product_Tablet  Color_Blue  ...
0 4 1200 1 0 0 0 ...
1 3 800 0 1 0 1 ...
...

When to use One-Hot Encoding:

  • For nominal categorical features without ordinal relationships
  • When the algorithm doesn't innately handle categorical data (like linear regression, SVM)
  • When the number of categories is relatively small

Limitations:

  • Creates many columns for high-cardinality features
  • Can lead to the "curse of dimensionality"

Ordinal Encoding

Ordinal encoding is appropriate when categories have a meaningful order.

python
# Define an ordering for the 'Size' column
size_ordering = {'Small': 1, 'Medium': 2, 'Large': 3}

# Apply the mapping
df['Size_ordinal'] = df['Size'].map(size_ordering)
print(df[['Size', 'Size_ordinal']])

Output:

     Size  Size_ordinal
0 Large 3
1 Medium 2
2 Small 1
3 Large 3
4 Medium 2

Using scikit-learn's OrdinalEncoder:

python
from sklearn.preprocessing import OrdinalEncoder

# Define the categories in order
size_categories = [['Small', 'Medium', 'Large']] # Note the nested list

# Initialize the encoder
ord_encoder = OrdinalEncoder(categories=size_categories)

# Reshape for scikit-learn's API and transform
df['Size_ordinal_sklearn'] = ord_encoder.fit_transform(df[['Size']])
print(df[['Size', 'Size_ordinal_sklearn']])

Output:

     Size  Size_ordinal_sklearn
0 Large 2.0
1 Medium 1.0
2 Small 0.0
3 Large 2.0
4 Medium 1.0

When to use Ordinal Encoding:

  • When the categorical variable has a clear, meaningful order
  • When the ordering relationship matters for the analysis or model

Binary Encoding

Binary encoding is a more space-efficient alternative to one-hot encoding for high-cardinality features. It represents each integer as its binary representation.

python
# We'll need category-encoders package
# !pip install category-encoders

import category_encoders as ce

# Initialize the encoder
binary_encoder = ce.BinaryEncoder(cols=['Product'])

# Apply binary encoding
df_binary = binary_encoder.fit_transform(df)
print(df_binary[['Product', 'Product_0', 'Product_1']].head())

Output:

  Product  Product_0  Product_1
0 Laptop 0 0
1 Phone 1 0
2 Tablet 0 1
3 Laptop 0 0
4 Phone 1 0

When to use Binary Encoding:

  • For high-cardinality nominal features
  • When you want a balance between one-hot encoding and label encoding
  • To reduce dimensionality compared to one-hot encoding

Practical Examples

Example 1: Preparing Data for a Machine Learning Model

Let's prepare a dataset for a machine learning model to predict prices:

python
# More realistic dataset
housing_data = {
'neighborhood': ['Downtown', 'Suburb', 'Rural', 'Downtown', 'Suburb', 'Rural'],
'house_type': ['Apartment', 'House', 'House', 'Apartment', 'House', 'House'],
'size_category': ['Small', 'Medium', 'Large', 'Small', 'Large', 'Medium'],
'price': [250000, 320000, 180000, 230000, 380000, 210000]
}

housing_df = pd.DataFrame(housing_data)
print("Original dataset:")
print(housing_df)

# Encoding for machine learning
# One-hot encode neighborhood and house_type (nominal)
nominal_encoded = pd.get_dummies(housing_df, columns=['neighborhood', 'house_type'])

# Ordinal encode size_category
size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
nominal_encoded['size_encoded'] = nominal_encoded['size_category'].map(size_order)

# Final dataframe ready for modeling
ml_ready_df = nominal_encoded.drop('size_category', axis=1)
print("\nDataset ready for modeling:")
print(ml_ready_df.head())

Output:

Original dataset:
neighborhood house_type size_category price
0 Downtown Apartment Small 250000
1 Suburb House Medium 320000
2 Rural House Large 180000
3 Downtown Apartment Small 230000
4 Suburb House Large 380000
5 Rural House Medium 210000

Dataset ready for modeling:
price neighborhood_Downtown neighborhood_Rural neighborhood_Suburb \
0 250000 1 0 0
1 320000 0 0 1
2 180000 0 1 0
3 230000 1 0 0
4 380000 0 0 1
5 210000 0 1 0

house_type_Apartment house_type_House size_encoded
0 1 0 1
1 0 1 2
2 0 1 3
3 1 0 1
4 0 1 3
5 0 1 2

Example 2: Handling Missing Values in Categorical Data

python
# Dataset with missing values
missing_data = {
'product': ['Laptop', 'Phone', None, 'Tablet', 'Phone'],
'category': ['Electronics', 'Electronics', 'Clothing', None, 'Electronics'],
'price': [1200, 800, 50, 500, 750]
}

missing_df = pd.DataFrame(missing_data)
print("Dataset with missing values:")
print(missing_df)

# Fill missing values before encoding
missing_df['product'].fillna('Unknown', inplace=True)
missing_df['category'].fillna('Unknown', inplace=True)

# Encode data with get_dummies
# The dummy_na parameter creates an additional column for missing values
encoded_missing = pd.get_dummies(missing_df, columns=['product', 'category'], dummy_na=False)
print("\nEncoded dataset:")
print(encoded_missing)

Output:

Dataset with missing values:
product category price
0 Laptop Electronics 1200
1 Phone Electronics 800
2 None Clothing 50
3 Tablet None 500
4 Phone Electronics 750

Encoded dataset:
price product_Laptop product_Phone product_Tablet product_Unknown \
0 1200 1 0 0 0
1 800 0 1 0 0
2 50 0 0 0 1
3 500 0 0 1 0
4 750 0 1 0 0

category_Clothing category_Electronics category_Unknown
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0

Example 3: Encoding for Time Series Data

python
# Time series data with day of week
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
ts_data = {
'date': dates,
'value': np.random.randn(10) * 10 + 100
}

ts_df = pd.DataFrame(ts_data)

# Extract day of week
ts_df['day_of_week'] = ts_df['date'].dt.day_name()
print("Time series data:")
print(ts_df)

# Cyclical encoding for day of week (especially useful for time features)
# Map days to numbers 0-6
days_mapping = {
'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3,
'Friday': 4, 'Saturday': 5, 'Sunday': 6
}
ts_df['day_num'] = ts_df['day_of_week'].map(days_mapping)

# Create cyclical features using sine and cosine transformations
ts_df['day_sin'] = np.sin(ts_df['day_num'] * (2 * np.pi / 7))
ts_df['day_cos'] = np.cos(ts_df['day_num'] * (2 * np.pi / 7))

print("\nTime series with cyclical encoding:")
print(ts_df[['date', 'day_of_week', 'day_num', 'day_sin', 'day_cos']])

Output:

Time series data:
date value day_of_week
0 2023-01-01 101.356352 Sunday
1 2023-01-02 109.154123 Monday
...

Time series with cyclical encoding:
date day_of_week day_num day_sin day_cos
0 2023-01-01 Sunday 6 0.781831 0.623490
1 2023-01-02 Monday 0 0.000000 1.000000
...

Choosing the Right Encoding Method

Choosing the right encoding method is crucial for your analysis. Here's a quick guide:

Encoding MethodGood ForLimitations
Label EncodingOrdinal variables, tree-based algorithmsCreates false ordering for nominal variables
One-Hot EncodingNominal variables, linear modelsIncreases dimensionality, problematic with high cardinality
Ordinal EncodingVariables with clear orderingRequires domain knowledge to establish order
Binary EncodingHigh cardinality featuresMore complex than simpler methods
Cyclical EncodingCyclical features (time, directions)Only applicable to special cases

Summary

Categorical encoding is a fundamental step in data preprocessing. In this tutorial, we've covered:

  • Label encoding for converting categories to integers
  • One-hot encoding for creating binary columns
  • Ordinal encoding for ordered categorical variables
  • Binary encoding for efficient representation of high-cardinality features
  • Practical examples showing how to apply these techniques

Remember that the choice of encoding method should depend on:

  1. The nature of the categorical variable (nominal vs. ordinal)
  2. The machine learning algorithm you're using
  3. The cardinality (number of unique values) of the feature

Additional Resources and Exercises

Resources

Exercises

  1. Basic Encoding: Create a dataset with at least 3 categorical features and apply label encoding to all of them.

  2. Mixed Encoding: Create a dataset with both nominal and ordinal features. Apply the appropriate encoding method to each.

  3. Advanced Challenge: Take a public dataset (like Titanic from Kaggle) and prepare all categorical features using the most appropriate encoding techniques.

  4. Performance Comparison: Compare the performance of a machine learning model (like Random Forest) on the same dataset using different encoding techniques.

  5. Custom Encoder: Implement a frequency encoder that replaces categories with their frequency in the dataset.

Happy encoding!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)