Pandas Categorical Encoding
Introduction
Categorical data is ubiquitous in real-world datasets: gender, country, color, product categories, education levels, and so on. However, most machine learning algorithms require numerical input. This is where categorical encoding comes in - the process of converting categorical variables into a numerical format that algorithms can work with.
In this tutorial, we'll explore different methods in pandas to encode categorical data and when to use each approach. We'll cover:
- Understanding categorical data
- Label encoding
- One-hot encoding
- Ordinal encoding
- Binary encoding
- Other advanced encoding techniques
Understanding Categorical Data
Before diving into encoding techniques, let's understand the types of categorical data:
- Nominal categories: Categories with no inherent order (e.g., colors, countries)
- Ordinal categories: Categories with a meaningful order (e.g., education levels, rating scales)
Let's create a simple dataset to work with:
import pandas as pd
import numpy as np
# Sample data
data = {
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Large', 'Medium', 'Small', 'Large', 'Medium'],
'Rating': [4, 3, 5, 2, 3],
'Price': [1200, 800, 500, 1300, 750]
}
df = pd.DataFrame(data)
print(df)
Output:
Product Color Size Rating Price
0 Laptop Red Large 4 1200
1 Phone Blue Medium 3 800
2 Tablet Green Small 5 500
3 Laptop Blue Large 2 1300
4 Phone Red Medium 3 750
Label Encoding
Label encoding replaces each category with a unique integer. This is the simplest form of encoding.
Using pandas factorize()
# Label encoding with factorize()
df['Product_encoded'], _ = pd.factorize(df['Product'])
df['Color_encoded'], _ = pd.factorize(df['Color'])
print(df[['Product', 'Product_encoded', 'Color', 'Color_encoded']])
Output:
Product Product_encoded Color Color_encoded
0 Laptop 0 Red 0
1 Phone 1 Blue 1
2 Tablet 2 Green 2
3 Laptop 0 Blue 1
4 Phone 1 Red 0
Using scikit-learn
For more advanced encoding that can be applied easily to test data, we can use scikit-learn:
from sklearn.preprocessing import LabelEncoder
# Initialize the encoder
le = LabelEncoder()
# Fit and transform the 'Size' column
df['Size_encoded'] = le.fit_transform(df['Size'])
print(df[['Size', 'Size_encoded']])
Output:
Size Size_encoded
0 Large 1
1 Medium 2
2 Small 0
3 Large 1
4 Medium 2
When to use Label Encoding:
- When the categorical variable is ordinal (has a meaningful order)
- When the algorithm can handle numerical relationships (like decision trees)
- When you have high-cardinality features (many unique categories)
Limitations:
- Creates a false sense of ordering in nominal categories
- The model might interpret higher numbers as "more important"
One-Hot Encoding
One-hot encoding creates binary columns for each category, which avoids the ordinal relationship issue in label encoding.
Using pandas get_dummies()
# One-hot encoding with get_dummies()
color_dummies = pd.get_dummies(df['Color'], prefix='Color')
print(color_dummies)
# Adding the encoded columns to the original dataframe
df_with_dummies = pd.concat([df, color_dummies], axis=1)
print(df_with_dummies.head())
Output:
Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
Product Color Size Rating Price Color_Blue Color_Green Color_Red
0 Laptop Red Large 4 1200 0 0 1
1 Phone Blue Medium 3 800 1 0 0
2 Tablet Green Small 5 500 0 1 0
3 Laptop Blue Large 2 1300 1 0 0
4 Phone Red Medium 3 750 0 0 1
One-Hot Encoding Multiple Columns
# One-hot encoding multiple columns at once
df_encoded = pd.get_dummies(df, columns=['Product', 'Color', 'Size'])
print(df_encoded.head())
Output (abbreviated):
Rating Price Product_Laptop Product_Phone Product_Tablet Color_Blue ...
0 4 1200 1 0 0 0 ...
1 3 800 0 1 0 1 ...
...
When to use One-Hot Encoding:
- For nominal categorical features without ordinal relationships
- When the algorithm doesn't innately handle categorical data (like linear regression, SVM)
- When the number of categories is relatively small
Limitations:
- Creates many columns for high-cardinality features
- Can lead to the "curse of dimensionality"
Ordinal Encoding
Ordinal encoding is appropriate when categories have a meaningful order.
# Define an ordering for the 'Size' column
size_ordering = {'Small': 1, 'Medium': 2, 'Large': 3}
# Apply the mapping
df['Size_ordinal'] = df['Size'].map(size_ordering)
print(df[['Size', 'Size_ordinal']])
Output:
Size Size_ordinal
0 Large 3
1 Medium 2
2 Small 1
3 Large 3
4 Medium 2
Using scikit-learn's OrdinalEncoder
:
from sklearn.preprocessing import OrdinalEncoder
# Define the categories in order
size_categories = [['Small', 'Medium', 'Large']] # Note the nested list
# Initialize the encoder
ord_encoder = OrdinalEncoder(categories=size_categories)
# Reshape for scikit-learn's API and transform
df['Size_ordinal_sklearn'] = ord_encoder.fit_transform(df[['Size']])
print(df[['Size', 'Size_ordinal_sklearn']])
Output:
Size Size_ordinal_sklearn
0 Large 2.0
1 Medium 1.0
2 Small 0.0
3 Large 2.0
4 Medium 1.0
When to use Ordinal Encoding:
- When the categorical variable has a clear, meaningful order
- When the ordering relationship matters for the analysis or model
Binary Encoding
Binary encoding is a more space-efficient alternative to one-hot encoding for high-cardinality features. It represents each integer as its binary representation.
# We'll need category-encoders package
# !pip install category-encoders
import category_encoders as ce
# Initialize the encoder
binary_encoder = ce.BinaryEncoder(cols=['Product'])
# Apply binary encoding
df_binary = binary_encoder.fit_transform(df)
print(df_binary[['Product', 'Product_0', 'Product_1']].head())
Output:
Product Product_0 Product_1
0 Laptop 0 0
1 Phone 1 0
2 Tablet 0 1
3 Laptop 0 0
4 Phone 1 0
When to use Binary Encoding:
- For high-cardinality nominal features
- When you want a balance between one-hot encoding and label encoding
- To reduce dimensionality compared to one-hot encoding
Practical Examples
Example 1: Preparing Data for a Machine Learning Model
Let's prepare a dataset for a machine learning model to predict prices:
# More realistic dataset
housing_data = {
'neighborhood': ['Downtown', 'Suburb', 'Rural', 'Downtown', 'Suburb', 'Rural'],
'house_type': ['Apartment', 'House', 'House', 'Apartment', 'House', 'House'],
'size_category': ['Small', 'Medium', 'Large', 'Small', 'Large', 'Medium'],
'price': [250000, 320000, 180000, 230000, 380000, 210000]
}
housing_df = pd.DataFrame(housing_data)
print("Original dataset:")
print(housing_df)
# Encoding for machine learning
# One-hot encode neighborhood and house_type (nominal)
nominal_encoded = pd.get_dummies(housing_df, columns=['neighborhood', 'house_type'])
# Ordinal encode size_category
size_order = {'Small': 1, 'Medium': 2, 'Large': 3}
nominal_encoded['size_encoded'] = nominal_encoded['size_category'].map(size_order)
# Final dataframe ready for modeling
ml_ready_df = nominal_encoded.drop('size_category', axis=1)
print("\nDataset ready for modeling:")
print(ml_ready_df.head())
Output:
Original dataset:
neighborhood house_type size_category price
0 Downtown Apartment Small 250000
1 Suburb House Medium 320000
2 Rural House Large 180000
3 Downtown Apartment Small 230000
4 Suburb House Large 380000
5 Rural House Medium 210000
Dataset ready for modeling:
price neighborhood_Downtown neighborhood_Rural neighborhood_Suburb \
0 250000 1 0 0
1 320000 0 0 1
2 180000 0 1 0
3 230000 1 0 0
4 380000 0 0 1
5 210000 0 1 0
house_type_Apartment house_type_House size_encoded
0 1 0 1
1 0 1 2
2 0 1 3
3 1 0 1
4 0 1 3
5 0 1 2
Example 2: Handling Missing Values in Categorical Data
# Dataset with missing values
missing_data = {
'product': ['Laptop', 'Phone', None, 'Tablet', 'Phone'],
'category': ['Electronics', 'Electronics', 'Clothing', None, 'Electronics'],
'price': [1200, 800, 50, 500, 750]
}
missing_df = pd.DataFrame(missing_data)
print("Dataset with missing values:")
print(missing_df)
# Fill missing values before encoding
missing_df['product'].fillna('Unknown', inplace=True)
missing_df['category'].fillna('Unknown', inplace=True)
# Encode data with get_dummies
# The dummy_na parameter creates an additional column for missing values
encoded_missing = pd.get_dummies(missing_df, columns=['product', 'category'], dummy_na=False)
print("\nEncoded dataset:")
print(encoded_missing)
Output:
Dataset with missing values:
product category price
0 Laptop Electronics 1200
1 Phone Electronics 800
2 None Clothing 50
3 Tablet None 500
4 Phone Electronics 750
Encoded dataset:
price product_Laptop product_Phone product_Tablet product_Unknown \
0 1200 1 0 0 0
1 800 0 1 0 0
2 50 0 0 0 1
3 500 0 0 1 0
4 750 0 1 0 0
category_Clothing category_Electronics category_Unknown
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Example 3: Encoding for Time Series Data
# Time series data with day of week
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
ts_data = {
'date': dates,
'value': np.random.randn(10) * 10 + 100
}
ts_df = pd.DataFrame(ts_data)
# Extract day of week
ts_df['day_of_week'] = ts_df['date'].dt.day_name()
print("Time series data:")
print(ts_df)
# Cyclical encoding for day of week (especially useful for time features)
# Map days to numbers 0-6
days_mapping = {
'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3,
'Friday': 4, 'Saturday': 5, 'Sunday': 6
}
ts_df['day_num'] = ts_df['day_of_week'].map(days_mapping)
# Create cyclical features using sine and cosine transformations
ts_df['day_sin'] = np.sin(ts_df['day_num'] * (2 * np.pi / 7))
ts_df['day_cos'] = np.cos(ts_df['day_num'] * (2 * np.pi / 7))
print("\nTime series with cyclical encoding:")
print(ts_df[['date', 'day_of_week', 'day_num', 'day_sin', 'day_cos']])
Output:
Time series data:
date value day_of_week
0 2023-01-01 101.356352 Sunday
1 2023-01-02 109.154123 Monday
...
Time series with cyclical encoding:
date day_of_week day_num day_sin day_cos
0 2023-01-01 Sunday 6 0.781831 0.623490
1 2023-01-02 Monday 0 0.000000 1.000000
...
Choosing the Right Encoding Method
Choosing the right encoding method is crucial for your analysis. Here's a quick guide:
Encoding Method | Good For | Limitations |
---|---|---|
Label Encoding | Ordinal variables, tree-based algorithms | Creates false ordering for nominal variables |
One-Hot Encoding | Nominal variables, linear models | Increases dimensionality, problematic with high cardinality |
Ordinal Encoding | Variables with clear ordering | Requires domain knowledge to establish order |
Binary Encoding | High cardinality features | More complex than simpler methods |
Cyclical Encoding | Cyclical features (time, directions) | Only applicable to special cases |
Summary
Categorical encoding is a fundamental step in data preprocessing. In this tutorial, we've covered:
- Label encoding for converting categories to integers
- One-hot encoding for creating binary columns
- Ordinal encoding for ordered categorical variables
- Binary encoding for efficient representation of high-cardinality features
- Practical examples showing how to apply these techniques
Remember that the choice of encoding method should depend on:
- The nature of the categorical variable (nominal vs. ordinal)
- The machine learning algorithm you're using
- The cardinality (number of unique values) of the feature
Additional Resources and Exercises
Resources
Exercises
-
Basic Encoding: Create a dataset with at least 3 categorical features and apply label encoding to all of them.
-
Mixed Encoding: Create a dataset with both nominal and ordinal features. Apply the appropriate encoding method to each.
-
Advanced Challenge: Take a public dataset (like Titanic from Kaggle) and prepare all categorical features using the most appropriate encoding techniques.
-
Performance Comparison: Compare the performance of a machine learning model (like Random Forest) on the same dataset using different encoding techniques.
-
Custom Encoder: Implement a frequency encoder that replaces categories with their frequency in the dataset.
Happy encoding!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)