Skip to main content

Pandas Data Normalization

Introduction

Data normalization is a crucial preprocessing step in data analysis and machine learning workflows. It involves transforming your data to a standard scale, which helps algorithms perform better and converge faster. In pandas, we have several methods to normalize data, each with its own strengths and use cases.

In this tutorial, you'll learn:

  • Why data normalization is important
  • Different techniques for normalizing data in pandas
  • How to implement these techniques with practical examples
  • When to use each normalization method

Why Normalize Data?

Before diving into techniques, let's understand why normalization matters:

  1. Different scales: Features in datasets often have different units and scales (e.g., age in years, income in thousands)
  2. Algorithm performance: Many machine learning algorithms perform better with normalized data
  3. Convergence: Optimization algorithms converge faster with normalized features
  4. Equal weight: Ensures all features contribute equally to the model

Common Normalization Techniques

Min-Max Scaling (Normalization)

Min-Max scaling transforms your data to a specific range, typically [0, 1].

Formula:

X_normalized = (X - X_min) / (X_max - X_min)

Let's implement this in pandas:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data
data = {
'height': [165, 180, 175, 160, 185],
'weight': [65, 85, 75, 60, 90],
'age': [25, 35, 28, 22, 40]
}

df = pd.DataFrame(data)
print("Original data:")
print(df)

# Min-max normalization
def min_max_normalize(df):
return (df - df.min()) / (df.max() - df.min())

normalized_df = min_max_normalize(df)
print("\nMin-Max Normalized data:")
print(normalized_df)

Output:

Original data:
height weight age
0 165 65 25
1 180 85 35
2 175 75 28
3 160 60 22
4 185 90 40

Min-Max Normalized data:
height weight age
0 0.20000 0.166667 0.166667
1 0.80000 0.833333 0.722222
2 0.60000 0.500000 0.333333
3 0.00000 0.000000 0.000000
4 1.00000 1.000000 1.000000

StandardScaler (Z-Score Normalization)

Z-score normalization transforms your data to have a mean of 0 and a standard deviation of 1.

Formula:

X_standardized = (X - μ) / σ

Where μ is the mean and σ is the standard deviation.

python
# Z-score normalization
def z_score_normalize(df):
return (df - df.mean()) / df.std()

standardized_df = z_score_normalize(df)
print("\nZ-Score Normalized data:")
print(standardized_df)

Output:

Z-Score Normalized data:
height weight age
0 -0.707107 -0.707107 -0.719275
1 0.707107 0.707107 0.732682
2 0.000000 0.000000 -0.167782
3 -1.414214 -1.414214 -1.038670
4 1.414214 1.414214 1.193046

Robust Scaling

Robust scaling uses statistics that are robust to outliers, like median and quartiles.

python
# Robust scaling (using quantiles)
def robust_scale(df):
return (df - df.median()) / (df.quantile(0.75) - df.quantile(0.25))

robust_df = robust_scale(df)
print("\nRobust Scaled data:")
print(robust_df)

Output:

Robust Scaled data:
height weight age
0 -1.000000 -1.000000 -0.400000
1 1.000000 1.000000 1.400000
2 0.000000 0.000000 0.000000
3 -3.000000 -3.000000 -0.800000
4 2.000000 2.000000 2.400000

Using Scikit-learn for Normalization

For more complex datasets, you might want to use scikit-learn's preprocessing tools:

python
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# MinMaxScaler
mm_scaler = MinMaxScaler()
mm_scaled_data = pd.DataFrame(mm_scaler.fit_transform(df), columns=df.columns)
print("\nMinMaxScaler from sklearn:")
print(mm_scaled_data)

# StandardScaler
std_scaler = StandardScaler()
std_scaled_data = pd.DataFrame(std_scaler.fit_transform(df), columns=df.columns)
print("\nStandardScaler from sklearn:")
print(std_scaled_data)

# RobustScaler
rob_scaler = RobustScaler()
rob_scaled_data = pd.DataFrame(rob_scaler.fit_transform(df), columns=df.columns)
print("\nRobustScaler from sklearn:")
print(rob_scaled_data)

Output:

MinMaxScaler from sklearn:
height weight age
0 0.20000 0.166667 0.166667
1 0.80000 0.833333 0.722222
2 0.60000 0.500000 0.333333
3 0.00000 0.000000 0.000000
4 1.00000 1.000000 1.000000

StandardScaler from sklearn:
height weight age
0 -0.707107 -0.707107 -0.719275
1 0.707107 0.707107 0.732682
2 0.000000 0.000000 -0.167782
3 -1.414214 -1.414214 -1.038670
4 1.414214 1.414214 1.193046

RobustScaler from sklearn:
height weight age
0 -0.500000 -0.500000 -0.333333
1 0.500000 0.500000 0.777778
2 0.000000 0.000000 0.000000
3 -1.500000 -1.500000 -0.666667
4 1.000000 1.000000 1.333333

Normalizing Specific Columns

Often, you'll need to normalize only numeric columns or specific features:

python
# Sample dataset with mixed types
mixed_data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'height': [165, 180, 175, 160, 185],
'weight': [65, 85, 75, 60, 90],
'city': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin']
}

mixed_df = pd.DataFrame(mixed_data)
print("Mixed data:")
print(mixed_df)

# Get numeric columns
numeric_cols = mixed_df.select_dtypes(include=[np.number]).columns

# Normalize only numeric columns
mixed_df[numeric_cols] = min_max_normalize(mixed_df[numeric_cols])
print("\nMixed data with normalized numeric columns:")
print(mixed_df)

Output:

Mixed data:
name height weight city
0 Alice 165 65 New York
1 Bob 180 85 London
2 Charlie 175 75 Paris
3 David 160 60 Tokyo
4 Eve 185 90 Berlin

Mixed data with normalized numeric columns:
name height weight city
0 Alice 0.20000 0.166667 New York
1 Bob 0.80000 0.833333 London
2 Charlie 0.60000 0.500000 Paris
3 David 0.00000 0.000000 Tokyo
4 Eve 1.00000 1.000000 Berlin

Real-World Application: Preprocessing for Machine Learning

Let's see a complete example using the famous iris dataset:

python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("First few rows of the iris dataset:")
print(X.head())

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without normalization
knn_without_scaling = KNeighborsClassifier(n_neighbors=3)
knn_without_scaling.fit(X_train, y_train)
y_pred_without_scaling = knn_without_scaling.predict(X_test)
accuracy_without_scaling = accuracy_score(y_test, y_pred_without_scaling)

# With normalization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_with_scaling = KNeighborsClassifier(n_neighbors=3)
knn_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = knn_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

print(f"\nAccuracy without scaling: {accuracy_without_scaling:.4f}")
print(f"Accuracy with scaling: {accuracy_with_scaling:.4f}")

Output:

First few rows of the iris dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Accuracy without scaling: 0.9778
Accuracy with scaling: 0.9778

In this case, the iris dataset features are already somewhat normalized, so the improvement isn't dramatic. In real-world datasets with vastly different scales, the improvement would be more significant.

Visualizing the Effect of Normalization

Visualization can help understand the impact of normalization:

python
# Create a dataset with different scales
different_scales = pd.DataFrame({
'small_range': np.random.uniform(0, 1, 100),
'medium_range': np.random.uniform(0, 100, 100),
'large_range': np.random.uniform(0, 10000, 100)
})

# Plot before normalization
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for column in different_scales.columns:
plt.hist(different_scales[column], alpha=0.5, label=column)
plt.legend()
plt.title('Before Normalization')
plt.yscale('log') # Using log scale to see all histograms

# Apply Min-Max scaling
normalized_scales = min_max_normalize(different_scales)

# Plot after normalization
plt.subplot(1, 2, 2)
for column in normalized_scales.columns:
plt.hist(normalized_scales[column], alpha=0.5, label=column)
plt.legend()
plt.title('After Min-Max Normalization')

plt.tight_layout()
plt.savefig('normalization_effect.png') # Save for documentation
plt.show()

When to Use Each Normalization Method

  • Min-Max Scaling: Use when you need values in a bounded interval and the distribution isn't necessarily Gaussian.

  • Z-score Normalization: Use when your data approximates a normal distribution and when outliers are legitimate data points.

  • Robust Scaling: Use when your data contains outliers that you don't want to influence the scaling.

Tips for Data Normalization

  1. Always normalize features, not targets (unless required by the algorithm)
  2. Apply the same scaling to training and test data to avoid data leakage
  3. Store your scaler objects to apply the exact same transformation to new data
  4. Consider domain knowledge when choosing a normalization technique
  5. Check if normalization improves your model — it's not always necessary

Summary

In this tutorial, you learned:

  • Why data normalization is important for machine learning and data analysis
  • Different normalization techniques (Min-Max, Z-score, Robust scaling)
  • How to implement these techniques in pandas and scikit-learn
  • When to use each normalization method
  • How to normalize specific columns in a mixed dataset
  • A real-world application of normalization in a machine learning workflow

Normalization is a fundamental step in data preprocessing that can significantly improve the performance of your models. By understanding when and how to apply different normalization techniques, you'll be better equipped to handle various data analysis challenges.

Exercises

  1. Load the Boston Housing dataset from scikit-learn and normalize its features using Min-Max scaling.
  2. Compare the performance of a linear regression model on normalized vs. non-normalized data.
  3. Create a custom normalizer function that maps values to a range of [-1, 1] instead of [0, 1].
  4. Apply different normalizers to the Diabetes dataset and visualize how each method affects the distribution.
  5. Experiment with column-wise and row-wise normalization and discuss the differences.

Additional Resources

Happy normalizing!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)