Pandas Data Cleaning Pipeline
Data cleaning is rarely a one-step process. In real-world scenarios, you'll often need to perform multiple operations on your data to get it into a usable state. Building an effective data cleaning pipeline allows you to transform raw, messy data into clean, analysis-ready data in a systematic and repeatable way.
Understanding Data Cleaning Pipelines
A data cleaning pipeline is a sequence of operations applied to raw data to prepare it for analysis. With pandas, we can create powerful and flexible pipelines that handle common data issues like missing values, duplicates, incorrect data types, and more.
Let's learn how to build effective data cleaning pipelines with pandas.
Setting Up Our Environment
First, let's import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Creating a Basic Data Cleaning Pipeline
Let's create a sample messy dataset to work with:
# Create a sample messy dataset
data = {
'customer_id': ['001', '002', np.nan, '004', '005', '006', '006'],
'age': [25, 'unknown', 35, 40, np.nan, 45, 45],
'income': ['50,000', '75000', '60000', np.nan, '90,000', '65,000', '65,000'],
'signup_date': ['2020-01-15', '2020/02/20', '20-03-2020', '2020.04.05', np.nan, '2020-06-10', '2020-06-10']
}
df = pd.DataFrame(data)
print("Original dataset:")
print(df)
This will output:
Original dataset:
customer_id age income signup_date
0 001 25 50,000 2020-01-15
1 002 unknown 75000 2020/02/20
2 NaN 35 60000 20-03-2020
3 004 40 NaN 2020.04.05
4 005 NaN 90,000 NaN
5 006 45 65,000 2020-06-10
6 006 45 65,000 2020-06-10
Now, let's build a data cleaning pipeline step by step:
Step 1: Handle Missing Values
def handle_missing_values(df):
"""Handle missing values in the dataset."""
# Drop rows where customer_id is missing (key identifier)
df = df.dropna(subset=['customer_id'])
# Fill missing ages with the median age
numeric_ages = pd.to_numeric(df['age'], errors='coerce')
median_age = numeric_ages.median()
df['age'] = df['age'].replace('unknown', np.nan)
df['age'] = pd.to_numeric(df['age'], errors='coerce').fillna(median_age)
# For demonstration, we'll leave missing income and signup_date as is
return df
# Apply the function
df_cleaned = handle_missing_values(df)
print("\nAfter handling missing values:")
print(df_cleaned)
Output:
After handling missing values:
customer_id age income signup_date
0 001 25.0 50,000 2020-01-15
1 002 40.0 75000 2020/02/20
3 004 40.0 NaN 2020.04.05
4 005 40.0 90,000 NaN
5 006 45.0 65,000 2020-06-10
6 006 45.0 65,000 2020-06-10
Step 2: Standardize Formats
def standardize_formats(df):
"""Standardize formats for income and dates."""
# Clean income: remove commas and convert to numeric
df['income'] = df['income'].str.replace(',', '').astype(float)
# Standardize date format
# First, handle various separators
df['signup_date'] = df['signup_date'].str.replace('/', '-').str.replace('.', '-')
# Convert all dates to yyyy-mm-dd format
def standardize_date(date_str):
if pd.isna(date_str):
return np.nan
parts = date_str.split('-')
if len(parts) != 3:
return np.nan
# Check if the first part is a 2-digit or 4-digit year
if len(parts[0]) == 4: # yyyy-mm-dd
return date_str
elif len(parts[2]) == 4: # dd-mm-yyyy
return f"{parts[2]}-{parts[1]}-{parts[0]}"
else:
return np.nan
df['signup_date'] = df['signup_date'].apply(standardize_date)
return df
# Apply the function
df_cleaned = standardize_formats(df_cleaned)
print("\nAfter standardizing formats:")
print(df_cleaned)
Output:
After standardizing formats:
customer_id age income signup_date
0 001 25.0 50000.0 2020-01-15
1 002 40.0 75000.0 2020-02-20
3 004 40.0 NaN 2020-04-05
4 005 40.0 90000.0 NaN
5 006 45.0 65000.0 2020-06-10
6 006 45.0 65000.0 2020-06-10
Step 3: Remove Duplicates
def remove_duplicates(df):
"""Remove duplicate entries based on customer_id."""
df = df.drop_duplicates(subset=['customer_id'], keep='first')
return df
# Apply the function
df_cleaned = remove_duplicates(df_cleaned)
print("\nAfter removing duplicates:")
print(df_cleaned)
Output:
After removing duplicates:
customer_id age income signup_date
0 001 25.0 50000.0 2020-01-15
1 002 40.0 75000.0 2020-02-20
3 004 40.0 NaN 2020-04-05
4 005 40.0 90000.0 NaN
5 006 45.0 65000.0 2020-06-10
Step 4: Validate Data
def validate_data(df):
"""Validate data and create new features if needed."""
# Ensure age is within reasonable range
df.loc[df['age'] < 18, 'age'] = np.nan
df.loc[df['age'] > 100, 'age'] = np.nan
# Create a boolean column indicating if income data is available
df['has_income'] = ~df['income'].isna()
# Convert signup_date to datetime
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
return df
# Apply the function
df_cleaned = validate_data(df_cleaned)
print("\nAfter validating data:")
print(df_cleaned)
Output:
After validating data:
customer_id age income signup_date has_income
0 001 25.0 50000.0 2020-01-15 True
1 002 40.0 75000.0 2020-02-20 True
3 004 40.0 NaN 2020-04-05 False
4 005 40.0 90000.0 NaT True
5 006 45.0 65000.0 2020-06-10 True
Creating a Complete Data Cleaning Pipeline
Now, let's put all these steps together into a complete data cleaning pipeline:
def data_cleaning_pipeline(df):
"""Complete data cleaning pipeline."""
# Create a copy to avoid modifying the original dataframe
df_cleaned = df.copy()
# Apply each cleaning step
df_cleaned = handle_missing_values(df_cleaned)
df_cleaned = standardize_formats(df_cleaned)
df_cleaned = remove_duplicates(df_cleaned)
df_cleaned = validate_data(df_cleaned)
# Reset index for clean output
df_cleaned = df_cleaned.reset_index(drop=True)
return df_cleaned
# Apply the complete pipeline to the original dataframe
final_df = data_cleaning_pipeline(df)
print("\nFinal cleaned dataset:")
print(final_df)
Output:
Final cleaned dataset:
customer_id age income signup_date has_income
0 001 25.0 50000.0 2020-01-15 True
1 002 40.0 75000.0 2020-02-20 True
2 004 40.0 NaN 2020-04-05 False
3 005 40.0 90000.0 NaT True
4 006 45.0 65000.0 2020-06-10 True
Real-World Application: Cleaning E-commerce Data
Let's apply our pipeline concept to a more realistic scenario. Imagine we have e-commerce transaction data with various issues:
# Create a sample e-commerce dataset
ecommerce_data = {
'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005', 'T004', 'T007'],
'customer_id': ['C123', 'C456', 'C789', 'C101', np.nan, 'C101', 'C202'],
'product': ['Laptop', 'Phone', 'Headphones', 'Tablet', 'Monitor', 'Tablet', np.nan],
'quantity': [1, 2, '3pcs', 1, np.nan, 1, 5],
'price': ['$999.99', '$599.5', '129.99', '$349', '249.50', '$349', '$59.99'],
'date': ['2023-01-15', '2023/01/20', '01-25-2023', '2023.01.30', np.nan, '2023-01-30', '2023-02-05']
}
ecom_df = pd.DataFrame(ecommerce_data)
print("Original e-commerce dataset:")
print(ecom_df)
Output:
Original e-commerce dataset:
transaction_id customer_id product quantity price date
0 T001 C123 Laptop 1 $999.99 2023-01-15
1 T002 C456 Phone 2 $599.5 2023/01/20
2 T003 C789 Headphones 3pcs 129.99 01-25-2023
3 T004 C101 Tablet 1 $349 2023.01.30
4 T005 NaN Monitor NaN 249.50 NaN
5 T004 C101 Tablet 1 $349 2023-01-30
6 T007 C202 NaN 5 $59.99 2023-02-05
Let's build a specialized pipeline for this e-commerce data:
def clean_ecommerce_data(df):
"""Pipeline for cleaning e-commerce data."""
df_cleaned = df.copy()
# Step 1: Handle missing values
df_cleaned = df_cleaned.dropna(subset=['transaction_id', 'customer_id'])
# Step 2: Remove duplicates based on transaction_id
df_cleaned = df_cleaned.drop_duplicates(subset=['transaction_id'], keep='first')
# Step 3: Standardize quantity (convert to numeric)
df_cleaned['quantity'] = df_cleaned['quantity'].astype(str).str.extract('(\d+)').astype(float)
# Step 4: Clean price column (remove $ and convert to numeric)
df_cleaned['price'] = df_cleaned['price'].astype(str).str.replace('$', '').astype(float)
# Step 5: Standardize date format
df_cleaned['date'] = pd.to_datetime(
df_cleaned['date'].str.replace('/', '-').str.replace('.', '-'),
format='%Y-%m-%d',
errors='coerce'
)
# Step 6: Create total_amount column
df_cleaned['total_amount'] = df_cleaned['quantity'] * df_cleaned['price']
return df_cleaned
# Apply the pipeline
ecom_cleaned = clean_ecommerce_data(ecom_df)
print("\nCleaned e-commerce dataset:")
print(ecom_cleaned)
Output:
Cleaned e-commerce dataset:
transaction_id customer_id product quantity price date total_amount
0 T001 C123 Laptop 1.0 999.99 2023-01-15 999.99
1 T002 C456 Phone 2.0 599.50 2023-01-20 1199.00
2 T003 C789 Headphones 3.0 129.99 2023-01-25 389.97
3 T004 C101 Tablet 1.0 349.00 2023-01-30 349.00
6 T007 C202 NaN 5.0 59.99 2023-02-05 299.95
Best Practices for Data Cleaning Pipelines
When building data cleaning pipelines with pandas, keep these best practices in mind:
- Document each step: Make sure each function has clear docstrings explaining what it does
- Make pipelines modular: Break down complex cleaning tasks into smaller functions
- Handle errors gracefully: Use try/except blocks for operations that might fail
- Validate results: Check that your data makes sense after cleaning
- Create logs: Keep track of what changes your pipeline makes to the data
- Preserve raw data: Always work with a copy of your original data
Creating a More Advanced Pipeline Using Method Chaining
Pandas supports method chaining, which can make your pipelines more concise:
def advanced_pipeline(df):
"""Advanced pipeline using method chaining."""
return (df
.dropna(subset=['customer_id'])
.assign(
age=lambda x: pd.to_numeric(x['age'], errors='coerce'),
income=lambda x: pd.to_numeric(x['income'].str.replace(',', ''), errors='coerce'),
signup_date=lambda x: pd.to_datetime(x['signup_date'].str.replace('[/.]', '-', regex=True), errors='coerce')
)
.drop_duplicates(subset=['customer_id'])
.reset_index(drop=True)
)
# Apply the advanced pipeline to our original dataset
advanced_clean_df = advanced_pipeline(df)
print("\nCleaned with advanced pipeline:")
print(advanced_clean_df)
Output:
Cleaned with advanced pipeline:
customer_id age income signup_date
0 001 25.0 50000.0 2020-01-15
1 002 NaN 75000.0 2020-02-20
2 004 40.0 NaN 2020-04-05
3 005 NaN 90000.0 NaT
4 006 45.0 65000.0 2020-06-10
Summary
Data cleaning pipelines are essential for transforming raw, messy data into clean, analysis-ready formats. In this tutorial, we've learned how to:
- Build modular data cleaning functions for specific tasks
- Chain these functions together to create complete pipelines
- Apply our pipelines to real-world-like data
- Use method chaining for more concise pipeline definition
By organizing your data cleaning code into structured pipelines, you make your work:
- More maintainable
- Easier to debug
- Reusable across projects
- Clearer to understand
Remember that data cleaning is often iterative. You might need to run your pipeline, check the results, and refine your cleaning steps as you better understand your data's quirks and issues.
Exercises
To solidify your understanding, try these exercises:
- Modify the e-commerce cleaning pipeline to handle additional edge cases (e.g., products with "pack of X" in the name)
- Create a pipeline that cleans and standardizes address data (separating street, city, state, zip)
- Build a pipeline for cleaning text data (removing special characters, standardizing case, etc.)
- Implement logging in your pipeline to record what changes were made to each row
Additional Resources
- Pandas Documentation on Data Cleaning
- Pandas Method Chaining
- Real Python: Data Cleaning with Python
- Towards Data Science: Building Robust Data Pipelines
With these techniques, you're now equipped to handle various data cleaning challenges efficiently and systematically!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)