Pandas Data Cleaning Pipeline

Data cleaning is rarely a one-step process. In real-world scenarios, you'll often need to perform multiple operations on your data to get it into a usable state. Building an effective data cleaning pipeline allows you to transform raw, messy data into clean, analysis-ready data in a systematic and repeatable way.

Understanding Data Cleaning Pipelines

A data cleaning pipeline is a sequence of operations applied to raw data to prepare it for analysis. With pandas, we can create powerful and flexible pipelines that handle common data issues like missing values, duplicates, incorrect data types, and more.

Let's learn how to build effective data cleaning pipelines with pandas.

Setting Up Our Environment

First, let's import the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Creating a Basic Data Cleaning Pipeline

Let's create a sample messy dataset to work with:

# Create a sample messy dataset
data = {
    'customer_id': ['001', '002', np.nan, '004', '005', '006', '006'],
    'age': [25, 'unknown', 35, 40, np.nan, 45, 45],
    'income': ['50,000', '75000', '60000', np.nan, '90,000', '65,000', '65,000'],
    'signup_date': ['2020-01-15', '2020/02/20', '20-03-2020', '2020.04.05', np.nan, '2020-06-10', '2020-06-10']
}

df = pd.DataFrame(data)
print("Original dataset:")
print(df)

This will output:

Original dataset:
  customer_id     age  income signup_date
       001      25  50,000  2020-01-15
       002  unknown   75000  2020/02/20
       NaN      35   60000  20-03-2020
       004      40     NaN  2020.04.05
       005     NaN  90,000         NaN
       006      45  65,000  2020-06-10
       006      45  65,000  2020-06-10

Now, let's build a data cleaning pipeline step by step:

Step 1: Handle Missing Values

def handle_missing_values(df):
    """Handle missing values in the dataset."""
    # Drop rows where customer_id is missing (key identifier)
    df = df.dropna(subset=['customer_id'])
    
    # Fill missing ages with the median age
    numeric_ages = pd.to_numeric(df['age'], errors='coerce')
    median_age = numeric_ages.median()
    df['age'] = df['age'].replace('unknown', np.nan)
    df['age'] = pd.to_numeric(df['age'], errors='coerce').fillna(median_age)
    
    # For demonstration, we'll leave missing income and signup_date as is
    return df

# Apply the function
df_cleaned = handle_missing_values(df)
print("\nAfter handling missing values:")
print(df_cleaned)

Output:

After handling missing values:
  customer_id   age  income signup_date
       001  25.0  50,000  2020-01-15
       002  40.0   75000  2020/02/20
       004  40.0     NaN  2020.04.05
       005  40.0  90,000         NaN
       006  45.0  65,000  2020-06-10
       006  45.0  65,000  2020-06-10

Step 2: Standardize Formats

def standardize_formats(df):
    """Standardize formats for income and dates."""
    # Clean income: remove commas and convert to numeric
    df['income'] = df['income'].str.replace(',', '').astype(float)
    
    # Standardize date format
    # First, handle various separators
    df['signup_date'] = df['signup_date'].str.replace('/', '-').str.replace('.', '-')
    
    # Convert all dates to yyyy-mm-dd format
    def standardize_date(date_str):
        if pd.isna(date_str):
            return np.nan
            
        parts = date_str.split('-')
        if len(parts) != 3:
            return np.nan
            
        # Check if the first part is a 2-digit or 4-digit year
        if len(parts[0]) == 4:  # yyyy-mm-dd
            return date_str
        elif len(parts[2]) == 4:  # dd-mm-yyyy
            return f"{parts[2]}-{parts[1]}-{parts[0]}"
        else:
            return np.nan
    
    df['signup_date'] = df['signup_date'].apply(standardize_date)
    return df

# Apply the function
df_cleaned = standardize_formats(df_cleaned)
print("\nAfter standardizing formats:")
print(df_cleaned)

Output:

After standardizing formats:
  customer_id   age   income signup_date
       001  25.0  50000.0  2020-01-15
       002  40.0  75000.0  2020-02-20
       004  40.0      NaN  2020-04-05
       005  40.0  90000.0         NaN
       006  45.0  65000.0  2020-06-10
       006  45.0  65000.0  2020-06-10

Step 3: Remove Duplicates

def remove_duplicates(df):
    """Remove duplicate entries based on customer_id."""
    df = df.drop_duplicates(subset=['customer_id'], keep='first')
    return df

# Apply the function
df_cleaned = remove_duplicates(df_cleaned)
print("\nAfter removing duplicates:")
print(df_cleaned)

Output:

After removing duplicates:
  customer_id   age   income signup_date
       001  25.0  50000.0  2020-01-15
       002  40.0  75000.0  2020-02-20
       004  40.0      NaN  2020-04-05
       005  40.0  90000.0         NaN
       006  45.0  65000.0  2020-06-10

Step 4: Validate Data

def validate_data(df):
    """Validate data and create new features if needed."""
    # Ensure age is within reasonable range
    df.loc[df['age'] < 18, 'age'] = np.nan
    df.loc[df['age'] > 100, 'age'] = np.nan
    
    # Create a boolean column indicating if income data is available
    df['has_income'] = ~df['income'].isna()
    
    # Convert signup_date to datetime
    df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
    
    return df

# Apply the function
df_cleaned = validate_data(df_cleaned)
print("\nAfter validating data:")
print(df_cleaned)

Output:

After validating data:
  customer_id   age   income signup_date  has_income
       001  25.0  50000.0  2020-01-15       True
       002  40.0  75000.0  2020-02-20       True
       004  40.0      NaN  2020-04-05      False
       005  40.0  90000.0         NaT       True
       006  45.0  65000.0  2020-06-10       True

Creating a Complete Data Cleaning Pipeline

Now, let's put all these steps together into a complete data cleaning pipeline:

def data_cleaning_pipeline(df):
    """Complete data cleaning pipeline."""
    # Create a copy to avoid modifying the original dataframe
    df_cleaned = df.copy()
    
    # Apply each cleaning step
    df_cleaned = handle_missing_values(df_cleaned)
    df_cleaned = standardize_formats(df_cleaned)
    df_cleaned = remove_duplicates(df_cleaned)
    df_cleaned = validate_data(df_cleaned)
    
    # Reset index for clean output
    df_cleaned = df_cleaned.reset_index(drop=True)
    
    return df_cleaned

# Apply the complete pipeline to the original dataframe
final_df = data_cleaning_pipeline(df)
print("\nFinal cleaned dataset:")
print(final_df)

Output:

Final cleaned dataset:
  customer_id   age   income signup_date  has_income
       001  25.0  50000.0  2020-01-15       True
       002  40.0  75000.0  2020-02-20       True
       004  40.0      NaN  2020-04-05      False
       005  40.0  90000.0         NaT       True
       006  45.0  65000.0  2020-06-10       True

Real-World Application: Cleaning E-commerce Data

Let's apply our pipeline concept to a more realistic scenario. Imagine we have e-commerce transaction data with various issues:

# Create a sample e-commerce dataset
ecommerce_data = {
    'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005', 'T004', 'T007'],
    'customer_id': ['C123', 'C456', 'C789', 'C101', np.nan, 'C101', 'C202'],
    'product': ['Laptop', 'Phone', 'Headphones', 'Tablet', 'Monitor', 'Tablet', np.nan],
    'quantity': [1, 2, '3pcs', 1, np.nan, 1, 5],
    'price': ['$999.99', '$599.5', '129.99', '$349', '249.50', '$349', '$59.99'],
    'date': ['2023-01-15', '2023/01/20', '01-25-2023', '2023.01.30', np.nan, '2023-01-30', '2023-02-05']
}

ecom_df = pd.DataFrame(ecommerce_data)
print("Original e-commerce dataset:")
print(ecom_df)

Output:

Original e-commerce dataset:
  transaction_id customer_id      product quantity    price        date
        T001        C123       Laptop        1  $999.99  2023-01-15
        T002        C456        Phone        2   $599.5  2023/01/20
        T003        C789   Headphones     3pcs   129.99  01-25-2023
        T004        C101       Tablet        1     $349  2023.01.30
        T005         NaN      Monitor      NaN   249.50         NaN
        T004        C101       Tablet        1     $349  2023-01-30
        T007        C202         NaN        5   $59.99  2023-02-05

Let's build a specialized pipeline for this e-commerce data:

def clean_ecommerce_data(df):
    """Pipeline for cleaning e-commerce data."""
    df_cleaned = df.copy()
    
    # Step 1: Handle missing values
    df_cleaned = df_cleaned.dropna(subset=['transaction_id', 'customer_id'])
    
    # Step 2: Remove duplicates based on transaction_id
    df_cleaned = df_cleaned.drop_duplicates(subset=['transaction_id'], keep='first')
    
    # Step 3: Standardize quantity (convert to numeric)
    df_cleaned['quantity'] = df_cleaned['quantity'].astype(str).str.extract('(\d+)').astype(float)
    
    # Step 4: Clean price column (remove $ and convert to numeric)
    df_cleaned['price'] = df_cleaned['price'].astype(str).str.replace('$', '').astype(float)
    
    # Step 5: Standardize date format
    df_cleaned['date'] = pd.to_datetime(
        df_cleaned['date'].str.replace('/', '-').str.replace('.', '-'), 
        format='%Y-%m-%d', 
        errors='coerce'
    )
    
    # Step 6: Create total_amount column
    df_cleaned['total_amount'] = df_cleaned['quantity'] * df_cleaned['price']
    
    return df_cleaned

# Apply the pipeline
ecom_cleaned = clean_ecommerce_data(ecom_df)
print("\nCleaned e-commerce dataset:")
print(ecom_cleaned)

Output:

Cleaned e-commerce dataset:
  transaction_id customer_id      product  quantity   price       date  total_amount
        T001        C123       Laptop       1.0  999.99 2023-01-15       999.99
        T002        C456        Phone       2.0  599.50 2023-01-20      1199.00
        T003        C789   Headphones       3.0  129.99 2023-01-25       389.97
        T004        C101       Tablet       1.0  349.00 2023-01-30       349.00
        T007        C202         NaN       5.0   59.99 2023-02-05       299.95

Best Practices for Data Cleaning Pipelines

When building data cleaning pipelines with pandas, keep these best practices in mind:

Document each step: Make sure each function has clear docstrings explaining what it does
Make pipelines modular: Break down complex cleaning tasks into smaller functions
Handle errors gracefully: Use try/except blocks for operations that might fail
Validate results: Check that your data makes sense after cleaning
Create logs: Keep track of what changes your pipeline makes to the data
Preserve raw data: Always work with a copy of your original data

Creating a More Advanced Pipeline Using Method Chaining

Pandas supports method chaining, which can make your pipelines more concise:

def advanced_pipeline(df):
    """Advanced pipeline using method chaining."""
    return (df
            .dropna(subset=['customer_id'])
            .assign(
                age=lambda x: pd.to_numeric(x['age'], errors='coerce'),
                income=lambda x: pd.to_numeric(x['income'].str.replace(',', ''), errors='coerce'),
                signup_date=lambda x: pd.to_datetime(x['signup_date'].str.replace('[/.]', '-', regex=True), errors='coerce')
            )
            .drop_duplicates(subset=['customer_id'])
            .reset_index(drop=True)
           )

# Apply the advanced pipeline to our original dataset
advanced_clean_df = advanced_pipeline(df)
print("\nCleaned with advanced pipeline:")
print(advanced_clean_df)

Output:

Cleaned with advanced pipeline:
  customer_id   age   income signup_date
       001  25.0  50000.0  2020-01-15
       002   NaN  75000.0  2020-02-20
       004  40.0      NaN  2020-04-05
       005   NaN  90000.0         NaT
       006  45.0  65000.0  2020-06-10

Summary

Data cleaning pipelines are essential for transforming raw, messy data into clean, analysis-ready formats. In this tutorial, we've learned how to:

Build modular data cleaning functions for specific tasks
Chain these functions together to create complete pipelines
Apply our pipelines to real-world-like data
Use method chaining for more concise pipeline definition

By organizing your data cleaning code into structured pipelines, you make your work:

More maintainable
Easier to debug
Reusable across projects
Clearer to understand

Remember that data cleaning is often iterative. You might need to run your pipeline, check the results, and refine your cleaning steps as you better understand your data's quirks and issues.

Exercises

To solidify your understanding, try these exercises:

Modify the e-commerce cleaning pipeline to handle additional edge cases (e.g., products with "pack of X" in the name)
Create a pipeline that cleans and standardizes address data (separating street, city, state, zip)
Build a pipeline for cleaning text data (removing special characters, standardizing case, etc.)
Implement logging in your pipeline to record what changes were made to each row

Additional Resources

With these techniques, you're now equipped to handle various data cleaning challenges efficiently and systematically!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Understanding Data Cleaning Pipelines​

Setting Up Our Environment​

Creating a Basic Data Cleaning Pipeline​

Step 1: Handle Missing Values​

Step 2: Standardize Formats​

Step 3: Remove Duplicates​

Step 4: Validate Data​

Creating a Complete Data Cleaning Pipeline​

Real-World Application: Cleaning E-commerce Data​

Best Practices for Data Cleaning Pipelines​

Creating a More Advanced Pipeline Using Method Chaining​

Summary​

Exercises​

Additional Resources​

Understanding Data Cleaning Pipelines

Setting Up Our Environment

Creating a Basic Data Cleaning Pipeline

Step 1: Handle Missing Values

Step 2: Standardize Formats

Step 3: Remove Duplicates

Step 4: Validate Data

Creating a Complete Data Cleaning Pipeline

Real-World Application: Cleaning E-commerce Data

Best Practices for Data Cleaning Pipelines

Creating a More Advanced Pipeline Using Method Chaining

Summary

Exercises

Additional Resources