Pandas Data Quality Checks

When working with data in Python, ensuring your data is high quality is a critical step before any meaningful analysis can begin. In this tutorial, you'll learn how to implement essential data quality checks using the Pandas library to identify and fix common issues in your datasets.

Introduction to Data Quality

Data quality refers to how well your data serves its intended purpose. High-quality data should be:

Complete: No missing values where values are expected
Accurate: Free of errors and correctly representing the real world
Consistent: Following the same format and rules throughout the dataset
Unique: Free of duplicates (unless duplicates are expected)
Valid: Conforming to defined business rules or constraints

Let's explore how to check and ensure these qualities using Pandas.

Setting Up Your Environment

Before we begin, make sure you have Pandas installed:

pip install pandas

Let's import the libraries we'll use and create a sample dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For reproducibility
np.random.seed(42)

Creating a Sample Dataset

We'll work with a simple customer dataset that might contain some quality issues:

# Create a sample dataset with some quality issues
data = {
    'customer_id': [1001, 1002, 1002, 1003, 1004, 1005, None, 1007],
    'name': ['John Smith', 'Jane Doe', 'Jane Doe', 'Bob Johnson', 'Alice Brown', None, 'Chris Lee', 'Diana Wang'],
    'age': [34, 28, 28, -5, 47, 31, 29, 1200],
    'email': ['[email protected]', 'jane@example', 'jane@example', '[email protected]', '[email protected]', '[email protected]', None, '[email protected]'],
    'purchase_amount': [125.50, 240.00, 240.00, 550.75, None, 125.45, 350.00, 550.75],
    'purchase_date': ['2023-01-15', '2023-01-16', '2023-01-16', '2023/01/17', '2023-01-20', None, '2023-01-25', '2023-01-30']
}

df = pd.DataFrame(data)
print("Sample dataset:")
print(df)

This dataset has several quality issues:

Missing values
Duplicate records
Invalid values (negative age, unrealistic age)
Inconsistent date formats
Invalid email format

Let's implement data quality checks to identify and fix these issues.

1. Checking for Missing Values

Missing values can significantly impact your analysis. Let's identify them:

# Count missing values per column
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values)

# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values in Dataset')
plt.tight_layout()
plt.show()

# Calculate percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of missing values:")
print(missing_percentage)

Handling Missing Values

Based on your findings, you can decide how to handle missing values:

# Fill missing numeric values with mean
df['purchase_amount'] = df['purchase_amount'].fillna(df['purchase_amount'].mean())

# Fill missing categorical values with a placeholder
df['name'] = df['name'].fillna('Unknown')
df['purchase_date'] = df['purchase_date'].fillna('Unknown')

# Drop rows with missing critical information (like customer_id)
df = df.dropna(subset=['customer_id'])

# Alternatively, drop columns with too many missing values
# threshold = len(df) * 0.5  # 50% threshold
# df = df.dropna(axis=1, thresh=threshold)

print("\nDataset after handling missing values:")
print(df)

2. Checking for Duplicates

Duplicate records can skew your analysis by giving more weight to certain data points:

# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

# Identify which rows are duplicates
print("Duplicate rows:")
print(df[df.duplicated(keep='first')])

# Check for duplicates based on specific columns
duplicate_customers = df.duplicated(subset=['customer_id', 'name'], keep='first').sum()
print(f"Number of duplicate customers: {duplicate_customers}")

# Remove duplicates
df_no_dupes = df.drop_duplicates(keep='first')
print("\nDataset after removing duplicates:")
print(df_no_dupes)

3. Checking for Invalid Values

Invalid values can be numeric (out of range) or categorical (wrong format):

# Check for invalid age values (negative or unrealistically high)
invalid_ages = df_no_dupes[(df_no_dupes['age'] < 0) | (df_no_dupes['age'] > 120)]
print("\nRows with invalid ages:")
print(invalid_ages)

# Correct invalid ages
df_no_dupes.loc[df_no_dupes['age'] < 0, 'age'] = np.nan  # Set negative ages to NaN
df_no_dupes.loc[df_no_dupes['age'] > 120, 'age'] = np.nan  # Set unrealistic ages to NaN

# Fill NaN ages with median age
median_age = df_no_dupes['age'].median()
df_no_dupes['age'] = df_no_dupes['age'].fillna(median_age)

print("\nAges after correction:")
print(df_no_dupes[['name', 'age']])

4. Checking Data Types and Formats

Ensuring consistent data types and formats is crucial for proper analysis:

# Check data types
print("\nCurrent data types:")
print(df_no_dupes.dtypes)

# Check email format (simple check for '@' and '.')
def is_valid_email(email):
    if pd.isnull(email):
        return False
    return '@' in email and '.' in email

# Apply email validation
df_no_dupes['valid_email'] = df_no_dupes['email'].apply(is_valid_email)
print("\nInvalid emails:")
print(df_no_dupes[~df_no_dupes['valid_email']])

# Standardize date format
def standardize_date(date):
    if pd.isnull(date) or date == 'Unknown':
        return None
    # Replace slash with dash
    return date.replace('/', '-')

df_no_dupes['purchase_date'] = df_no_dupes['purchase_date'].apply(standardize_date)

# Convert to datetime
df_no_dupes['purchase_date'] = pd.to_datetime(df_no_dupes['purchase_date'], errors='coerce')

print("\nDates after standardization:")
print(df_no_dupes[['name', 'purchase_date']])

5. Statistical Outlier Detection

Outliers can significantly impact statistical analyses:

# Check for outliers in purchase amount using boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df_no_dupes['purchase_amount'])
plt.title('Purchase Amount Distribution')
plt.tight_layout()
plt.show()

# Detect outliers using IQR method
Q1 = df_no_dupes['purchase_amount'].quantile(0.25)
Q3 = df_no_dupes['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df_no_dupes[(df_no_dupes['purchase_amount'] < lower_bound) | 
                       (df_no_dupes['purchase_amount'] > upper_bound)]

print("\nOutliers in purchase amount:")
print(outliers[['name', 'purchase_amount']])

6. Creating a Data Quality Report

It's helpful to create a summary report of your data quality checks:

def generate_data_quality_report(df):
    """Generate a basic data quality report for a dataframe"""
    quality_report = {}
    
    # Record count
    quality_report['total_records'] = len(df)
    
    # Missing values
    quality_report['missing_values'] = df.isnull().sum().to_dict()
    quality_report['missing_percentage'] = ((df.isnull().sum() / len(df)) * 100).to_dict()
    
    # Duplicate records
    quality_report['duplicate_records'] = df.duplicated().sum()
    
    # Data types
    quality_report['data_types'] = df.dtypes.astype(str).to_dict()
    
    # Unique values for categorical columns
    categorical_columns = df.select_dtypes(include=['object']).columns
    quality_report['unique_values'] = {col: df[col].nunique() for col in categorical_columns}
    
    # Basic statistics for numeric columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    quality_report['numeric_stats'] = df[numeric_columns].describe().to_dict()
    
    return quality_report

# Generate and print the report
quality_report = generate_data_quality_report(df_no_dupes)
print("\nData Quality Report:")
for section, content in quality_report.items():
    print(f"\n{section.upper()}:")
    if isinstance(content, dict):
        for key, value in content.items():
            print(f"  {key}: {value}")
    else:
        print(f"  {content}")

7. Automating Data Quality Checks

For ongoing data quality monitoring, you can create a function that performs standard checks:

def run_data_quality_checks(df, rules=None):
    """
    Run automated data quality checks based on defined rules
    
    Parameters:
    df (DataFrame): The pandas dataframe to check
    rules (dict): Dictionary of rules to check
    
    Returns:
    dict: Results of the data quality checks
    """
    if rules is None:
        rules = {}
    
    results = {"passed": [], "failed": []}
    
    # Check for missing values
    for column in df.columns:
        missing_pct = (df[column].isnull().sum() / len(df)) * 100
        threshold = rules.get(f"{column}_missing_threshold", 5)  # Default 5% threshold
        
        check_name = f"Missing values check: {column} (threshold: {threshold}%)"
        if missing_pct <= threshold:
            results["passed"].append(f"{check_name} - {missing_pct:.2f}%")
        else:
            results["failed"].append(f"{check_name} - {missing_pct:.2f}%")
    
    # Check for duplicates
    duplicate_pct = (df.duplicated().sum() / len(df)) * 100
    threshold = rules.get("duplicate_threshold", 1)  # Default 1% threshold
    
    check_name = f"Duplicate records check (threshold: {threshold}%)"
    if duplicate_pct <= threshold:
        results["passed"].append(f"{check_name} - {duplicate_pct:.2f}%")
    else:
        results["failed"].append(f"{check_name} - {duplicate_pct:.2f}%")
    
    # Add more checks as needed for your specific requirements
    
    return results

# Define rules
my_rules = {
    "customer_id_missing_threshold": 0,  # No missing customer IDs allowed
    "name_missing_threshold": 10,         # Up to 10% missing names allowed
    "duplicate_threshold": 0             # No duplicates allowed
}

# Run the checks
check_results = run_data_quality_checks(df_no_dupes, my_rules)

print("\nAutomated Data Quality Check Results:")
print("PASSED CHECKS:")
for check in check_results["passed"]:
    print(f"✅ {check}")
    
print("\nFAILED CHECKS:")
for check in check_results["failed"]:
    print(f"❌ {check}")

8. Real-world Application: Customer Data Validation

Let's put everything together in a real-world scenario where we validate customer data before importing it into a CRM system:

def clean_customer_data(df):
    """
    Comprehensive function to clean and validate customer data
    """
    # Create a copy to avoid modifying the original
    clean_df = df.copy()
    
    # 1. Remove duplicates based on customer_id
    clean_df = clean_df.drop_duplicates(subset=['customer_id'], keep='first')
    
    # 2. Handle missing values
    # Drop rows with missing customer_id (critical field)
    clean_df = clean_df.dropna(subset=['customer_id'])
    
    # Fill other missing values appropriately
    clean_df['name'] = clean_df['name'].fillna('Unknown')
    clean_df['email'] = clean_df['email'].fillna('[email protected]')
    clean_df['purchase_amount'] = clean_df['purchase_amount'].fillna(clean_df['purchase_amount'].mean())
    
    # 3. Validate and correct data types
    # Convert customer_id to integer
    clean_df['customer_id'] = clean_df['customer_id'].astype(int)
    
    # 4. Validate age ranges
    clean_df.loc[clean_df['age'] < 0, 'age'] = np.nan
    clean_df.loc[clean_df['age'] > 120, 'age'] = np.nan
    clean_df['age'] = clean_df['age'].fillna(clean_df['age'].median())
    clean_df['age'] = clean_df['age'].astype(int)
    
    # 5. Validate email format
    clean_df['email_valid'] = clean_df['email'].apply(is_valid_email)
    
    # 6. Standardize dates
    clean_df['purchase_date'] = clean_df['purchase_date'].apply(standardize_date)
    clean_df['purchase_date'] = pd.to_datetime(clean_df['purchase_date'], errors='coerce')
    
    # Flag records with quality issues
    clean_df['has_quality_issues'] = (
        (~clean_df['email_valid']) | 
        (clean_df['name'] == 'Unknown')
    )
    
    return clean_df

# Apply the cleaning function
clean_customer_data = clean_customer_data(df)

print("\nCleaned Customer Data:")
print(clean_customer_data)

print("\nRecords with quality issues that need review:")
print(clean_customer_data[clean_customer_data['has_quality_issues']])

Summary

In this tutorial, you've learned how to implement essential data quality checks using Pandas:

Identifying missing values and applying appropriate handling strategies
Detecting and removing duplicates to ensure data integrity
Validating data to ensure it meets business rules and constraints
Standardizing formats for consistency
Detecting outliers that might skew your analysis
Creating data quality reports to summarize dataset health
Automating checks for ongoing monitoring
Applying these techniques to a real-world customer data scenario

By implementing these data quality checks, you'll ensure your data is clean, consistent, and ready for analysis, modeling, or reporting.

Practice Exercises

To reinforce your learning, try these exercises:

Create a data quality check function for a dataset of your choice
Implement a validation rule for phone numbers in the customer dataset
Create a visualization that shows the distribution of data quality issues
Implement a function that scores data quality on a scale from 0-100
Create a data validation pipeline that reads data from a CSV, performs quality checks, and outputs a cleaned CSV and a quality report

Additional Resources

Pandas Documentation
Great Expectations - A Python library for data validation
Pandera - A statistical data testing toolkit
pandas-profiling - Create detailed profile reports from pandas DataFrames

Happy data cleaning!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Data Quality​

Setting Up Your Environment​

Creating a Sample Dataset​

1. Checking for Missing Values​

Handling Missing Values​

2. Checking for Duplicates​

3. Checking for Invalid Values​

4. Checking Data Types and Formats​

5. Statistical Outlier Detection​

6. Creating a Data Quality Report​

7. Automating Data Quality Checks​

8. Real-world Application: Customer Data Validation​

Summary​

Practice Exercises​

Additional Resources​