Skip to main content

Pandas Data Quality Checks

When working with data in Python, ensuring your data is high quality is a critical step before any meaningful analysis can begin. In this tutorial, you'll learn how to implement essential data quality checks using the Pandas library to identify and fix common issues in your datasets.

Introduction to Data Quality

Data quality refers to how well your data serves its intended purpose. High-quality data should be:

  • Complete: No missing values where values are expected
  • Accurate: Free of errors and correctly representing the real world
  • Consistent: Following the same format and rules throughout the dataset
  • Unique: Free of duplicates (unless duplicates are expected)
  • Valid: Conforming to defined business rules or constraints

Let's explore how to check and ensure these qualities using Pandas.

Setting Up Your Environment

Before we begin, make sure you have Pandas installed:

bash
pip install pandas

Let's import the libraries we'll use and create a sample dataset:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For reproducibility
np.random.seed(42)

Creating a Sample Dataset

We'll work with a simple customer dataset that might contain some quality issues:

python
# Create a sample dataset with some quality issues
data = {
'customer_id': [1001, 1002, 1002, 1003, 1004, 1005, None, 1007],
'name': ['John Smith', 'Jane Doe', 'Jane Doe', 'Bob Johnson', 'Alice Brown', None, 'Chris Lee', 'Diana Wang'],
'age': [34, 28, 28, -5, 47, 31, 29, 1200],
'email': ['[email protected]', 'jane@example', 'jane@example', '[email protected]', '[email protected]', '[email protected]', None, '[email protected]'],
'purchase_amount': [125.50, 240.00, 240.00, 550.75, None, 125.45, 350.00, 550.75],
'purchase_date': ['2023-01-15', '2023-01-16', '2023-01-16', '2023/01/17', '2023-01-20', None, '2023-01-25', '2023-01-30']
}

df = pd.DataFrame(data)
print("Sample dataset:")
print(df)

This dataset has several quality issues:

  • Missing values
  • Duplicate records
  • Invalid values (negative age, unrealistic age)
  • Inconsistent date formats
  • Invalid email format

Let's implement data quality checks to identify and fix these issues.

1. Checking for Missing Values

Missing values can significantly impact your analysis. Let's identify them:

python
# Count missing values per column
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values)

# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values in Dataset')
plt.tight_layout()
plt.show()

# Calculate percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of missing values:")
print(missing_percentage)

Handling Missing Values

Based on your findings, you can decide how to handle missing values:

python
# Fill missing numeric values with mean
df['purchase_amount'] = df['purchase_amount'].fillna(df['purchase_amount'].mean())

# Fill missing categorical values with a placeholder
df['name'] = df['name'].fillna('Unknown')
df['purchase_date'] = df['purchase_date'].fillna('Unknown')

# Drop rows with missing critical information (like customer_id)
df = df.dropna(subset=['customer_id'])

# Alternatively, drop columns with too many missing values
# threshold = len(df) * 0.5 # 50% threshold
# df = df.dropna(axis=1, thresh=threshold)

print("\nDataset after handling missing values:")
print(df)

2. Checking for Duplicates

Duplicate records can skew your analysis by giving more weight to certain data points:

python
# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

# Identify which rows are duplicates
print("Duplicate rows:")
print(df[df.duplicated(keep='first')])

# Check for duplicates based on specific columns
duplicate_customers = df.duplicated(subset=['customer_id', 'name'], keep='first').sum()
print(f"Number of duplicate customers: {duplicate_customers}")

# Remove duplicates
df_no_dupes = df.drop_duplicates(keep='first')
print("\nDataset after removing duplicates:")
print(df_no_dupes)

3. Checking for Invalid Values

Invalid values can be numeric (out of range) or categorical (wrong format):

python
# Check for invalid age values (negative or unrealistically high)
invalid_ages = df_no_dupes[(df_no_dupes['age'] < 0) | (df_no_dupes['age'] > 120)]
print("\nRows with invalid ages:")
print(invalid_ages)

# Correct invalid ages
df_no_dupes.loc[df_no_dupes['age'] < 0, 'age'] = np.nan # Set negative ages to NaN
df_no_dupes.loc[df_no_dupes['age'] > 120, 'age'] = np.nan # Set unrealistic ages to NaN

# Fill NaN ages with median age
median_age = df_no_dupes['age'].median()
df_no_dupes['age'] = df_no_dupes['age'].fillna(median_age)

print("\nAges after correction:")
print(df_no_dupes[['name', 'age']])

4. Checking Data Types and Formats

Ensuring consistent data types and formats is crucial for proper analysis:

python
# Check data types
print("\nCurrent data types:")
print(df_no_dupes.dtypes)

# Check email format (simple check for '@' and '.')
def is_valid_email(email):
if pd.isnull(email):
return False
return '@' in email and '.' in email

# Apply email validation
df_no_dupes['valid_email'] = df_no_dupes['email'].apply(is_valid_email)
print("\nInvalid emails:")
print(df_no_dupes[~df_no_dupes['valid_email']])

# Standardize date format
def standardize_date(date):
if pd.isnull(date) or date == 'Unknown':
return None
# Replace slash with dash
return date.replace('/', '-')

df_no_dupes['purchase_date'] = df_no_dupes['purchase_date'].apply(standardize_date)

# Convert to datetime
df_no_dupes['purchase_date'] = pd.to_datetime(df_no_dupes['purchase_date'], errors='coerce')

print("\nDates after standardization:")
print(df_no_dupes[['name', 'purchase_date']])

5. Statistical Outlier Detection

Outliers can significantly impact statistical analyses:

python
# Check for outliers in purchase amount using boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df_no_dupes['purchase_amount'])
plt.title('Purchase Amount Distribution')
plt.tight_layout()
plt.show()

# Detect outliers using IQR method
Q1 = df_no_dupes['purchase_amount'].quantile(0.25)
Q3 = df_no_dupes['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df_no_dupes[(df_no_dupes['purchase_amount'] < lower_bound) |
(df_no_dupes['purchase_amount'] > upper_bound)]

print("\nOutliers in purchase amount:")
print(outliers[['name', 'purchase_amount']])

6. Creating a Data Quality Report

It's helpful to create a summary report of your data quality checks:

python
def generate_data_quality_report(df):
"""Generate a basic data quality report for a dataframe"""
quality_report = {}

# Record count
quality_report['total_records'] = len(df)

# Missing values
quality_report['missing_values'] = df.isnull().sum().to_dict()
quality_report['missing_percentage'] = ((df.isnull().sum() / len(df)) * 100).to_dict()

# Duplicate records
quality_report['duplicate_records'] = df.duplicated().sum()

# Data types
quality_report['data_types'] = df.dtypes.astype(str).to_dict()

# Unique values for categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
quality_report['unique_values'] = {col: df[col].nunique() for col in categorical_columns}

# Basic statistics for numeric columns
numeric_columns = df.select_dtypes(include=[np.number]).columns
quality_report['numeric_stats'] = df[numeric_columns].describe().to_dict()

return quality_report

# Generate and print the report
quality_report = generate_data_quality_report(df_no_dupes)
print("\nData Quality Report:")
for section, content in quality_report.items():
print(f"\n{section.upper()}:")
if isinstance(content, dict):
for key, value in content.items():
print(f" {key}: {value}")
else:
print(f" {content}")

7. Automating Data Quality Checks

For ongoing data quality monitoring, you can create a function that performs standard checks:

python
def run_data_quality_checks(df, rules=None):
"""
Run automated data quality checks based on defined rules

Parameters:
df (DataFrame): The pandas dataframe to check
rules (dict): Dictionary of rules to check

Returns:
dict: Results of the data quality checks
"""
if rules is None:
rules = {}

results = {"passed": [], "failed": []}

# Check for missing values
for column in df.columns:
missing_pct = (df[column].isnull().sum() / len(df)) * 100
threshold = rules.get(f"{column}_missing_threshold", 5) # Default 5% threshold

check_name = f"Missing values check: {column} (threshold: {threshold}%)"
if missing_pct <= threshold:
results["passed"].append(f"{check_name} - {missing_pct:.2f}%")
else:
results["failed"].append(f"{check_name} - {missing_pct:.2f}%")

# Check for duplicates
duplicate_pct = (df.duplicated().sum() / len(df)) * 100
threshold = rules.get("duplicate_threshold", 1) # Default 1% threshold

check_name = f"Duplicate records check (threshold: {threshold}%)"
if duplicate_pct <= threshold:
results["passed"].append(f"{check_name} - {duplicate_pct:.2f}%")
else:
results["failed"].append(f"{check_name} - {duplicate_pct:.2f}%")

# Add more checks as needed for your specific requirements

return results

# Define rules
my_rules = {
"customer_id_missing_threshold": 0, # No missing customer IDs allowed
"name_missing_threshold": 10, # Up to 10% missing names allowed
"duplicate_threshold": 0 # No duplicates allowed
}

# Run the checks
check_results = run_data_quality_checks(df_no_dupes, my_rules)

print("\nAutomated Data Quality Check Results:")
print("PASSED CHECKS:")
for check in check_results["passed"]:
print(f"✅ {check}")

print("\nFAILED CHECKS:")
for check in check_results["failed"]:
print(f"❌ {check}")

8. Real-world Application: Customer Data Validation

Let's put everything together in a real-world scenario where we validate customer data before importing it into a CRM system:

python
def clean_customer_data(df):
"""
Comprehensive function to clean and validate customer data
"""
# Create a copy to avoid modifying the original
clean_df = df.copy()

# 1. Remove duplicates based on customer_id
clean_df = clean_df.drop_duplicates(subset=['customer_id'], keep='first')

# 2. Handle missing values
# Drop rows with missing customer_id (critical field)
clean_df = clean_df.dropna(subset=['customer_id'])

# Fill other missing values appropriately
clean_df['name'] = clean_df['name'].fillna('Unknown')
clean_df['email'] = clean_df['email'].fillna('[email protected]')
clean_df['purchase_amount'] = clean_df['purchase_amount'].fillna(clean_df['purchase_amount'].mean())

# 3. Validate and correct data types
# Convert customer_id to integer
clean_df['customer_id'] = clean_df['customer_id'].astype(int)

# 4. Validate age ranges
clean_df.loc[clean_df['age'] < 0, 'age'] = np.nan
clean_df.loc[clean_df['age'] > 120, 'age'] = np.nan
clean_df['age'] = clean_df['age'].fillna(clean_df['age'].median())
clean_df['age'] = clean_df['age'].astype(int)

# 5. Validate email format
clean_df['email_valid'] = clean_df['email'].apply(is_valid_email)

# 6. Standardize dates
clean_df['purchase_date'] = clean_df['purchase_date'].apply(standardize_date)
clean_df['purchase_date'] = pd.to_datetime(clean_df['purchase_date'], errors='coerce')

# Flag records with quality issues
clean_df['has_quality_issues'] = (
(~clean_df['email_valid']) |
(clean_df['name'] == 'Unknown')
)

return clean_df

# Apply the cleaning function
clean_customer_data = clean_customer_data(df)

print("\nCleaned Customer Data:")
print(clean_customer_data)

print("\nRecords with quality issues that need review:")
print(clean_customer_data[clean_customer_data['has_quality_issues']])

Summary

In this tutorial, you've learned how to implement essential data quality checks using Pandas:

  1. Identifying missing values and applying appropriate handling strategies
  2. Detecting and removing duplicates to ensure data integrity
  3. Validating data to ensure it meets business rules and constraints
  4. Standardizing formats for consistency
  5. Detecting outliers that might skew your analysis
  6. Creating data quality reports to summarize dataset health
  7. Automating checks for ongoing monitoring
  8. Applying these techniques to a real-world customer data scenario

By implementing these data quality checks, you'll ensure your data is clean, consistent, and ready for analysis, modeling, or reporting.

Practice Exercises

To reinforce your learning, try these exercises:

  1. Create a data quality check function for a dataset of your choice
  2. Implement a validation rule for phone numbers in the customer dataset
  3. Create a visualization that shows the distribution of data quality issues
  4. Implement a function that scores data quality on a scale from 0-100
  5. Create a data validation pipeline that reads data from a CSV, performs quality checks, and outputs a cleaned CSV and a quality report

Additional Resources

Happy data cleaning!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)