Pandas Feature Engineering

Introduction

Feature engineering is the process of transforming raw data into features that better represent the underlying patterns in your data, making it more suitable for machine learning algorithms. In data science workflows, feature engineering is often considered one of the most important steps for improving model performance.

With pandas, a powerful Python library for data manipulation and analysis, you can efficiently perform various feature engineering tasks. This tutorial will guide you through common feature engineering techniques using pandas, from basic transformations to more advanced patterns.

Why Feature Engineering Matters

Before diving into the techniques, let's understand why feature engineering is crucial:

It helps extract meaningful information from raw data
It can improve model performance significantly
It allows you to incorporate domain knowledge into your data
It can help reduce dimensionality and simplify models

Getting Started with Feature Engineering in Pandas

Let's begin by importing the necessary libraries and creating a sample dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Create a sample dataset
data = {
    'customer_id': range(1, 11),
    'age': [25, 35, 45, 22, 38, 55, 41, 28, 33, 52],
    'income': [35000, 60000, 80000, 22000, 45000, 95000, 75000, 40000, 55000, 85000],
    'purchase_date': ['2023-01-15', '2023-02-20', '2023-01-30', '2023-03-10', '2023-02-05',
                      '2023-01-25', '2023-03-15', '2023-02-10', '2023-03-05', '2023-01-05'],
    'purchase_amount': [120.50, 300.25, 450.75, 75.30, 225.45, 550.80, 320.60, 150.20, 280.90, 500.15],
    'product_category': ['electronics', 'clothing', 'home', 'electronics', 'food',
                         'home', 'clothing', 'food', 'electronics', 'home']
}

df = pd.DataFrame(data)

# Convert purchase_date to datetime
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

# Display the first few rows
print(df.head())

This will output:

   customer_id  age  income purchase_date  purchase_amount product_category
          1   25   35000    2023-01-15          120.50      electronics
          2   35   60000    2023-02-20          300.25         clothing
          3   45   80000    2023-01-30          450.75             home
          4   22   22000    2023-03-10           75.30      electronics
          5   38   45000    2023-02-05          225.45             food

Basic Feature Engineering Techniques

1. Numerical Transformations

Scaling Features

Scaling features ensures that all numeric variables contribute equally to model training.

# Min-Max scaling (normalize to 0-1 range)
df['income_normalized'] = (df['income'] - df['income'].min()) / (df['income'].max() - df['income'].min())

# Standard scaling (z-score normalization)
df['age_standardized'] = (df['age'] - df['age'].mean()) / df['age'].std()

print(df[['income', 'income_normalized', 'age', 'age_standardized']].head())

Output:

   income  income_normalized  age  age_standardized
 35000          0.178082   25         -1.031379
 60000          0.520548   35          0.047063
 80000          0.794521   45          1.125505
 22000             0.000   22         -1.341760
 45000          0.315068   38          0.357444

Log Transformation

Useful for highly skewed data:

# Apply log transformation
df['log_purchase_amount'] = np.log(df['purchase_amount'])

# Compare distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.hist(df['purchase_amount'])
ax1.set_title('Original Purchase Amount')
ax2.hist(df['log_purchase_amount'])
ax2.set_title('Log-transformed Purchase Amount')
plt.tight_layout()
# plt.show()  # Uncomment this in your actual code

2. Binning (Discretization)

Binning transforms continuous variables into categorical ones:

# Create age groups
bins = [20, 30, 40, 50, 60]
labels = ['20s', '30s', '40s', '50s']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

# Create income categories
df['income_category'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])

print(df[['age', 'age_group', 'income', 'income_category']].head())

Output:

   age age_group  income income_category
 25       20s   35000            Low
 35       30s   60000         Medium
 45       40s   80000           High
 22       20s   22000            Low
 38       30s   45000         Medium

3. One-Hot Encoding

One-hot encoding converts categorical variables into a format that can be provided to machine learning algorithms:

# One-hot encode product categories
one_hot = pd.get_dummies(df['product_category'], prefix='category')
df = pd.concat([df, one_hot], axis=1)

print(df[['product_category', 'category_clothing', 'category_electronics', 'category_food', 'category_home']].head())

Output:

  product_category  category_clothing  category_electronics  category_food  category_home
    electronics                  0                    1              0              0
       clothing                  1                    0              0              0
           home                  0                    0              0              1
    electronics                  0                    1              0              0
           food                  0                    0              1              0

Advanced Feature Engineering Techniques

1. Feature Extraction from Datetime

Dates can be a rich source of features:

# Extract components from purchase date
df['purchase_year'] = df['purchase_date'].dt.year
df['purchase_month'] = df['purchase_date'].dt.month
df['purchase_day'] = df['purchase_date'].dt.day
df['purchase_dayofweek'] = df['purchase_date'].dt.dayofweek  # 0 is Monday, 6 is Sunday
df['purchase_quarter'] = df['purchase_date'].dt.quarter

# Is weekend feature
df['is_weekend'] = df['purchase_dayofweek'].isin([5, 6]).astype(int)

print(df[['purchase_date', 'purchase_month', 'purchase_dayofweek', 'is_weekend']].head())

Output:

  purchase_date  purchase_month  purchase_dayofweek  is_weekend
  2023-01-15               1                   6          1
  2023-02-20               2                   0          0
  2023-01-30               1                   0          0
  2023-03-10               3                   4          0
  2023-02-05               2                   6          1

2. Creating Interaction Features

Interaction features capture relationships between variables:

# Multiply age and income
df['age_income_interaction'] = df['age'] * df['income'] / 1000  # Scale down for readability

# Create a spending ratio (purchase amount relative to income)
df['spending_ratio'] = df['purchase_amount'] / df['income'] * 1000  # Multiply by 1000 for readability

print(df[['age', 'income', 'age_income_interaction', 'purchase_amount', 'spending_ratio']].head())

Output:

   age  income  age_income_interaction  purchase_amount  spending_ratio
 25   35000                   875.0          120.50        3.442857
 35   60000                  2100.0          300.25        5.004167
 45   80000                  3600.0          450.75        5.634375
 22   22000                   484.0           75.30        3.422727
 38   45000                  1710.0          225.45        5.010000

3. Aggregating Group Statistics

Creating statistical aggregates for different groups:

# Group statistics by product category
category_stats = df.groupby('product_category')['purchase_amount'].agg(['mean', 'min', 'max'])
print("Purchase statistics by product category:")
print(category_stats)

# Map these statistics back to our main dataframe
df = df.merge(category_stats, left_on='product_category', right_index=True, how='left')
df.rename(columns={'mean': 'category_avg_purchase', 
                  'min': 'category_min_purchase',
                  'max': 'category_max_purchase'}, inplace=True)

# Calculate how much each purchase deviates from the category average
df['purchase_vs_category_avg'] = df['purchase_amount'] - df['category_avg_purchase']

print("\nPurchase amount compared to category average:")
print(df[['product_category', 'purchase_amount', 'category_avg_purchase', 'purchase_vs_category_avg']].head())

Output:

Purchase statistics by product category:
                    mean     min     max
product_category                        
clothing        310.425  300.25  320.60
electronics     158.900   75.30  280.90
food            187.825  150.20  225.45
home            500.567  450.75  550.80

Purchase amount compared to category average:
  product_category  purchase_amount  category_avg_purchase  purchase_vs_category_avg
0      electronics          120.50              158.900              -38.400000
1         clothing          300.25              310.425              -10.175000
2             home          450.75              500.567              -49.816667
3      electronics           75.30              158.900              -83.600000
4             food          225.45              187.825               37.625000

4. Mathematical Transformations

Creating polynomial features and other mathematical transformations:

# Square and square root transformations
df['age_squared'] = df['age'] ** 2
df['income_sqrt'] = np.sqrt(df['income'])

# Logarithmic transformations (adding 1 to avoid log(0))
df['purchase_log'] = np.log1p(df['purchase_amount'])

print(df[['age', 'age_squared', 'income', 'income_sqrt', 'purchase_amount', 'purchase_log']].head())

Output:

   age  age_squared  income  income_sqrt  purchase_amount  purchase_log
 25         625   35000   187.082869          120.50     4.800909
 35        1225   60000   244.948974          300.25     5.705743
 45        2025   80000   282.842712          450.75     6.112282
 22         484   22000   148.323970           75.30     4.331331
 38        1444   45000   212.132034          225.45     5.420535

Real-World Application: Building a Customer Spending Prediction Feature Set

Let's put these techniques together to create a comprehensive feature set for predicting customer spending:

# Start with essential features
feature_df = df[['customer_id', 'age', 'income', 'purchase_amount', 'product_category']].copy()

# Add datetime features
feature_df['purchase_month'] = df['purchase_date'].dt.month
feature_df['purchase_quarter'] = df['purchase_date'].dt.quarter
feature_df['is_weekend'] = df['purchase_dayofweek'].isin([5, 6]).astype(int)

# Add transformed features
feature_df['income_log'] = np.log(df['income'])
feature_df['age_group'] = df['age_group']

# Add interaction features
feature_df['income_per_age'] = df['income'] / df['age']
feature_df['spending_ratio'] = df['purchase_amount'] / df['income'] * 1000

# Add normalized values
feature_df['purchase_amount_norm'] = (df['purchase_amount'] - df['purchase_amount'].mean()) / df['purchase_amount'].std()

# Add one-hot encoding for categorical variables
feature_df = pd.concat([
    feature_df, 
    pd.get_dummies(feature_df['product_category'], prefix='category')
], axis=1)

# Add the relative category features
feature_df['purchase_vs_category_avg'] = df['purchase_vs_category_avg']

# Display the final feature set
print(feature_df.head())

# Calculate feature importance (in a real scenario, you would use this with a model)
print("\nCorrelation with purchase amount:")
correlations = feature_df.drop(['customer_id', 'product_category'], axis=1).corr()['purchase_amount'].sort_values(ascending=False)
print(correlations)

Output:

   customer_id  age  income  purchase_amount product_category  purchase_month  purchase_quarter  is_weekend  income_log age_group  income_per_age  spending_ratio  purchase_amount_norm  category_clothing  category_electronics  category_food  category_home  purchase_vs_category_avg
0            1   25   35000          120.50      electronics               1                 1          1   10.462658       20s         1400.0        3.442857             -0.855103                  0                    1              0              0                -38.400000
1            2   35   60000          300.25         clothing               2                 1          0   11.002553       30s         1714.3        5.004167              0.059375                  1                    0              0              0                -10.175000
2            3   45   80000          450.75             home               1                 1          0   11.289794       40s         1777.8        5.634375              0.818373                  0                    0              0              1                -49.816667
3            4   22   22000           75.30      electronics               3                 1          0    9.998797       20s         1000.0        3.422727             -1.077900                  0                    1              0              0                -83.600000
4            5   38   45000          225.45             food               2                 1          1   10.714418       30s         1184.2        5.010000             -0.242056                  0                    0              1              0                 37.625000

Correlation with purchase amount:
purchase_amount            1.000000
income                     0.902123
purchase_amount_norm       0.850027
income_log                 0.848939
category_home              0.665663
income_per_age             0.514988
purchase_vs_category_avg   0.458211
age                        0.379046
spending_ratio             0.117562
...

Summary and Best Practices

Feature engineering with pandas allows you to:

Improve model performance by creating features that better represent underlying patterns
Transform data to be more suitable for machine learning algorithms
Incorporate domain knowledge into your data preparation process
Extract valuable information from complex data types like dates and categories

Key techniques we covered:

Numerical transformations (scaling, log transformations)
Binning continuous variables
One-hot encoding categorical variables
Feature extraction from datetime fields
Creating interaction features
Aggregating group statistics
Mathematical transformations

Best practices:

Keep feature creation aligned with your problem domain
Monitor for multicollinearity in created features
Document your feature engineering steps for reproducibility
Use visualization to understand feature distributions
Consider the impact of feature engineering on model interpretability

Exercises

Add a "high_spender" binary feature to the dataset that is 1 if a customer's purchase amount is above the average, 0 otherwise.
Create a "days_since_quarter_start" feature that calculates how many days after the start of the quarter each purchase was made.
Implement a feature that represents the spending as a percentage of the maximum spending in that customer's age group.
Create polynomial features (degree 2) for age and income, and examine their correlation with purchase amount.
Experiment with different binning strategies for purchase_amount and compare their effect on the correlation patterns.

Additional Resources

Pandas Documentation
Feature Engineering for Machine Learning (Book by Alice Zheng and Amanda Casari)
Scikit-learn Feature Engineering Guide
Kaggle Feature Engineering Tutorial

By mastering feature engineering with pandas, you'll be able to significantly improve the performance of your machine learning models and gain deeper insights into your data.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Feature Engineering Matters​

Getting Started with Feature Engineering in Pandas​

Basic Feature Engineering Techniques​

1. Numerical Transformations​

Scaling Features​

Log Transformation​

2. Binning (Discretization)​

3. One-Hot Encoding​

Advanced Feature Engineering Techniques​

1. Feature Extraction from Datetime​

2. Creating Interaction Features​

3. Aggregating Group Statistics​

4. Mathematical Transformations​

Real-World Application: Building a Customer Spending Prediction Feature Set​

Summary and Best Practices​

Exercises​

Additional Resources​

Introduction

Why Feature Engineering Matters

Getting Started with Feature Engineering in Pandas

Basic Feature Engineering Techniques

1. Numerical Transformations

Scaling Features

Log Transformation

2. Binning (Discretization)

3. One-Hot Encoding

Advanced Feature Engineering Techniques

1. Feature Extraction from Datetime

2. Creating Interaction Features

3. Aggregating Group Statistics

4. Mathematical Transformations

Real-World Application: Building a Customer Spending Prediction Feature Set

Summary and Best Practices

Exercises

Additional Resources