Skip to main content

Pandas Feature Engineering

Introduction

Feature engineering is the process of transforming raw data into features that better represent the underlying patterns in your data, making it more suitable for machine learning algorithms. In data science workflows, feature engineering is often considered one of the most important steps for improving model performance.

With pandas, a powerful Python library for data manipulation and analysis, you can efficiently perform various feature engineering tasks. This tutorial will guide you through common feature engineering techniques using pandas, from basic transformations to more advanced patterns.

Why Feature Engineering Matters

Before diving into the techniques, let's understand why feature engineering is crucial:

  • It helps extract meaningful information from raw data
  • It can improve model performance significantly
  • It allows you to incorporate domain knowledge into your data
  • It can help reduce dimensionality and simplify models

Getting Started with Feature Engineering in Pandas

Let's begin by importing the necessary libraries and creating a sample dataset:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Create a sample dataset
data = {
'customer_id': range(1, 11),
'age': [25, 35, 45, 22, 38, 55, 41, 28, 33, 52],
'income': [35000, 60000, 80000, 22000, 45000, 95000, 75000, 40000, 55000, 85000],
'purchase_date': ['2023-01-15', '2023-02-20', '2023-01-30', '2023-03-10', '2023-02-05',
'2023-01-25', '2023-03-15', '2023-02-10', '2023-03-05', '2023-01-05'],
'purchase_amount': [120.50, 300.25, 450.75, 75.30, 225.45, 550.80, 320.60, 150.20, 280.90, 500.15],
'product_category': ['electronics', 'clothing', 'home', 'electronics', 'food',
'home', 'clothing', 'food', 'electronics', 'home']
}

df = pd.DataFrame(data)

# Convert purchase_date to datetime
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

# Display the first few rows
print(df.head())

This will output:

   customer_id  age  income purchase_date  purchase_amount product_category
0 1 25 35000 2023-01-15 120.50 electronics
1 2 35 60000 2023-02-20 300.25 clothing
2 3 45 80000 2023-01-30 450.75 home
3 4 22 22000 2023-03-10 75.30 electronics
4 5 38 45000 2023-02-05 225.45 food

Basic Feature Engineering Techniques

1. Numerical Transformations

Scaling Features

Scaling features ensures that all numeric variables contribute equally to model training.

python
# Min-Max scaling (normalize to 0-1 range)
df['income_normalized'] = (df['income'] - df['income'].min()) / (df['income'].max() - df['income'].min())

# Standard scaling (z-score normalization)
df['age_standardized'] = (df['age'] - df['age'].mean()) / df['age'].std()

print(df[['income', 'income_normalized', 'age', 'age_standardized']].head())

Output:

   income  income_normalized  age  age_standardized
0 35000 0.178082 25 -1.031379
1 60000 0.520548 35 0.047063
2 80000 0.794521 45 1.125505
3 22000 0.000 22 -1.341760
4 45000 0.315068 38 0.357444

Log Transformation

Useful for highly skewed data:

python
# Apply log transformation
df['log_purchase_amount'] = np.log(df['purchase_amount'])

# Compare distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.hist(df['purchase_amount'])
ax1.set_title('Original Purchase Amount')
ax2.hist(df['log_purchase_amount'])
ax2.set_title('Log-transformed Purchase Amount')
plt.tight_layout()
# plt.show() # Uncomment this in your actual code

2. Binning (Discretization)

Binning transforms continuous variables into categorical ones:

python
# Create age groups
bins = [20, 30, 40, 50, 60]
labels = ['20s', '30s', '40s', '50s']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

# Create income categories
df['income_category'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])

print(df[['age', 'age_group', 'income', 'income_category']].head())

Output:

   age age_group  income income_category
0 25 20s 35000 Low
1 35 30s 60000 Medium
2 45 40s 80000 High
3 22 20s 22000 Low
4 38 30s 45000 Medium

3. One-Hot Encoding

One-hot encoding converts categorical variables into a format that can be provided to machine learning algorithms:

python
# One-hot encode product categories
one_hot = pd.get_dummies(df['product_category'], prefix='category')
df = pd.concat([df, one_hot], axis=1)

print(df[['product_category', 'category_clothing', 'category_electronics', 'category_food', 'category_home']].head())

Output:

  product_category  category_clothing  category_electronics  category_food  category_home
0 electronics 0 1 0 0
1 clothing 1 0 0 0
2 home 0 0 0 1
3 electronics 0 1 0 0
4 food 0 0 1 0

Advanced Feature Engineering Techniques

1. Feature Extraction from Datetime

Dates can be a rich source of features:

python
# Extract components from purchase date
df['purchase_year'] = df['purchase_date'].dt.year
df['purchase_month'] = df['purchase_date'].dt.month
df['purchase_day'] = df['purchase_date'].dt.day
df['purchase_dayofweek'] = df['purchase_date'].dt.dayofweek # 0 is Monday, 6 is Sunday
df['purchase_quarter'] = df['purchase_date'].dt.quarter

# Is weekend feature
df['is_weekend'] = df['purchase_dayofweek'].isin([5, 6]).astype(int)

print(df[['purchase_date', 'purchase_month', 'purchase_dayofweek', 'is_weekend']].head())

Output:

  purchase_date  purchase_month  purchase_dayofweek  is_weekend
0 2023-01-15 1 6 1
1 2023-02-20 2 0 0
2 2023-01-30 1 0 0
3 2023-03-10 3 4 0
4 2023-02-05 2 6 1

2. Creating Interaction Features

Interaction features capture relationships between variables:

python
# Multiply age and income
df['age_income_interaction'] = df['age'] * df['income'] / 1000 # Scale down for readability

# Create a spending ratio (purchase amount relative to income)
df['spending_ratio'] = df['purchase_amount'] / df['income'] * 1000 # Multiply by 1000 for readability

print(df[['age', 'income', 'age_income_interaction', 'purchase_amount', 'spending_ratio']].head())

Output:

   age  income  age_income_interaction  purchase_amount  spending_ratio
0 25 35000 875.0 120.50 3.442857
1 35 60000 2100.0 300.25 5.004167
2 45 80000 3600.0 450.75 5.634375
3 22 22000 484.0 75.30 3.422727
4 38 45000 1710.0 225.45 5.010000

3. Aggregating Group Statistics

Creating statistical aggregates for different groups:

python
# Group statistics by product category
category_stats = df.groupby('product_category')['purchase_amount'].agg(['mean', 'min', 'max'])
print("Purchase statistics by product category:")
print(category_stats)

# Map these statistics back to our main dataframe
df = df.merge(category_stats, left_on='product_category', right_index=True, how='left')
df.rename(columns={'mean': 'category_avg_purchase',
'min': 'category_min_purchase',
'max': 'category_max_purchase'}, inplace=True)

# Calculate how much each purchase deviates from the category average
df['purchase_vs_category_avg'] = df['purchase_amount'] - df['category_avg_purchase']

print("\nPurchase amount compared to category average:")
print(df[['product_category', 'purchase_amount', 'category_avg_purchase', 'purchase_vs_category_avg']].head())

Output:

Purchase statistics by product category:
mean min max
product_category
clothing 310.425 300.25 320.60
electronics 158.900 75.30 280.90
food 187.825 150.20 225.45
home 500.567 450.75 550.80

Purchase amount compared to category average:
product_category purchase_amount category_avg_purchase purchase_vs_category_avg
0 electronics 120.50 158.900 -38.400000
1 clothing 300.25 310.425 -10.175000
2 home 450.75 500.567 -49.816667
3 electronics 75.30 158.900 -83.600000
4 food 225.45 187.825 37.625000

4. Mathematical Transformations

Creating polynomial features and other mathematical transformations:

python
# Square and square root transformations
df['age_squared'] = df['age'] ** 2
df['income_sqrt'] = np.sqrt(df['income'])

# Logarithmic transformations (adding 1 to avoid log(0))
df['purchase_log'] = np.log1p(df['purchase_amount'])

print(df[['age', 'age_squared', 'income', 'income_sqrt', 'purchase_amount', 'purchase_log']].head())

Output:

   age  age_squared  income  income_sqrt  purchase_amount  purchase_log
0 25 625 35000 187.082869 120.50 4.800909
1 35 1225 60000 244.948974 300.25 5.705743
2 45 2025 80000 282.842712 450.75 6.112282
3 22 484 22000 148.323970 75.30 4.331331
4 38 1444 45000 212.132034 225.45 5.420535

Real-World Application: Building a Customer Spending Prediction Feature Set

Let's put these techniques together to create a comprehensive feature set for predicting customer spending:

python
# Start with essential features
feature_df = df[['customer_id', 'age', 'income', 'purchase_amount', 'product_category']].copy()

# Add datetime features
feature_df['purchase_month'] = df['purchase_date'].dt.month
feature_df['purchase_quarter'] = df['purchase_date'].dt.quarter
feature_df['is_weekend'] = df['purchase_dayofweek'].isin([5, 6]).astype(int)

# Add transformed features
feature_df['income_log'] = np.log(df['income'])
feature_df['age_group'] = df['age_group']

# Add interaction features
feature_df['income_per_age'] = df['income'] / df['age']
feature_df['spending_ratio'] = df['purchase_amount'] / df['income'] * 1000

# Add normalized values
feature_df['purchase_amount_norm'] = (df['purchase_amount'] - df['purchase_amount'].mean()) / df['purchase_amount'].std()

# Add one-hot encoding for categorical variables
feature_df = pd.concat([
feature_df,
pd.get_dummies(feature_df['product_category'], prefix='category')
], axis=1)

# Add the relative category features
feature_df['purchase_vs_category_avg'] = df['purchase_vs_category_avg']

# Display the final feature set
print(feature_df.head())

# Calculate feature importance (in a real scenario, you would use this with a model)
print("\nCorrelation with purchase amount:")
correlations = feature_df.drop(['customer_id', 'product_category'], axis=1).corr()['purchase_amount'].sort_values(ascending=False)
print(correlations)

Output:

   customer_id  age  income  purchase_amount product_category  purchase_month  purchase_quarter  is_weekend  income_log age_group  income_per_age  spending_ratio  purchase_amount_norm  category_clothing  category_electronics  category_food  category_home  purchase_vs_category_avg
0 1 25 35000 120.50 electronics 1 1 1 10.462658 20s 1400.0 3.442857 -0.855103 0 1 0 0 -38.400000
1 2 35 60000 300.25 clothing 2 1 0 11.002553 30s 1714.3 5.004167 0.059375 1 0 0 0 -10.175000
2 3 45 80000 450.75 home 1 1 0 11.289794 40s 1777.8 5.634375 0.818373 0 0 0 1 -49.816667
3 4 22 22000 75.30 electronics 3 1 0 9.998797 20s 1000.0 3.422727 -1.077900 0 1 0 0 -83.600000
4 5 38 45000 225.45 food 2 1 1 10.714418 30s 1184.2 5.010000 -0.242056 0 0 1 0 37.625000

Correlation with purchase amount:
purchase_amount 1.000000
income 0.902123
purchase_amount_norm 0.850027
income_log 0.848939
category_home 0.665663
income_per_age 0.514988
purchase_vs_category_avg 0.458211
age 0.379046
spending_ratio 0.117562
...

Summary and Best Practices

Feature engineering with pandas allows you to:

  1. Improve model performance by creating features that better represent underlying patterns
  2. Transform data to be more suitable for machine learning algorithms
  3. Incorporate domain knowledge into your data preparation process
  4. Extract valuable information from complex data types like dates and categories

Key techniques we covered:

  • Numerical transformations (scaling, log transformations)
  • Binning continuous variables
  • One-hot encoding categorical variables
  • Feature extraction from datetime fields
  • Creating interaction features
  • Aggregating group statistics
  • Mathematical transformations

Best practices:

  • Keep feature creation aligned with your problem domain
  • Monitor for multicollinearity in created features
  • Document your feature engineering steps for reproducibility
  • Use visualization to understand feature distributions
  • Consider the impact of feature engineering on model interpretability

Exercises

  1. Add a "high_spender" binary feature to the dataset that is 1 if a customer's purchase amount is above the average, 0 otherwise.
  2. Create a "days_since_quarter_start" feature that calculates how many days after the start of the quarter each purchase was made.
  3. Implement a feature that represents the spending as a percentage of the maximum spending in that customer's age group.
  4. Create polynomial features (degree 2) for age and income, and examine their correlation with purchase amount.
  5. Experiment with different binning strategies for purchase_amount and compare their effect on the correlation patterns.

Additional Resources

By mastering feature engineering with pandas, you'll be able to significantly improve the performance of your machine learning models and gain deeper insights into your data.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)