Pandas Data Validation

Data validation is a crucial step in any data analysis workflow. Before you can trust your analysis results, you must ensure that your data is clean, accurate, and meets your expectations. In this tutorial, we'll explore different techniques for validating data in pandas.

Introduction to Data Validation

Data validation is the process of ensuring your data meets certain quality standards before analysis. It helps identify issues like:

Missing values
Incorrect data types
Values outside expected ranges
Inconsistent formatting
Duplicate records
Logical inconsistencies

By implementing proper validation, you can prevent errors in your analysis and ensure your conclusions are reliable.

Basic Data Validation Techniques

1. Checking Data Types

One of the first validation steps is ensuring columns have the correct data types.

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
    'age': [25, 'thirty', 35, 40, 45],  # Note the string in a numeric column
    'email': ['[email protected]', 'bob@example', '[email protected]', '[email protected]', '[email protected]']
})

# Check data types
print(df.dtypes)

Output:

id       int64
name     object
age      object  # Should be numeric, but contains a string
email    object
dtype: object

Notice that age is of type object (pandas' way of handling strings) instead of numeric because of the non-numeric value 'thirty'.

2. Identifying Invalid Values

You can identify invalid numeric values using pandas functions:

python
# Try to convert age to numeric, with errors='coerce' to set invalid values to NaN
df['age_numeric'] = pd.to_numeric(df['age'], errors='coerce')
print(df)

# Now we can see which rows have invalid age values
print("Rows with invalid ages:")
print(df[df['age_numeric'].isna()])

Output:

   id     name     age               email  age_numeric
0   1    Alice      25  [email protected]         25.0
1   2      Bob  thirty      bob@example          NaN
2   3  Charlie      35  [email protected]       35.0
3   4     Dave      40  [email protected]         40.0
4   5      Eve      45  [email protected]          45.0

Rows with invalid ages:
   id  name     age          email  age_numeric
1   2   Bob  thirty  bob@example          NaN

Advanced Validation Techniques

1. Value Range Validation

Ensure numeric values fall within expected ranges:

python
# Check if ages are within a reasonable range (e.g., 0-120)
valid_age_mask = (df['age_numeric'] >= 0) & (df['age_numeric'] <= 120)
invalid_ages = df[~valid_age_mask & df['age_numeric'].notna()]
print("Ages outside valid range:")
print(invalid_ages)

# No output because all valid ages are within range

2. Pattern Matching for Strings

Validate string columns using regular expressions:

python
import re

# Define a simple email validation pattern
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

# Create a function to validate emails
def is_valid_email(email):
    if pd.isna(email):
        return False
    return bool(re.match(email_pattern, email))

# Apply the validation function
df['valid_email'] = df['email'].apply(is_valid_email)
print(df[['email', 'valid_email']])

Output:

                email  valid_email
[email protected]         True
    bob@example       False
[email protected]       True
[email protected]         True
[email protected]          True

3. Uniqueness Constraints

Check for duplicate values in columns that should be unique:

python
# Check if IDs are unique
duplicate_ids = df[df['id'].duplicated()]
print(f"Number of duplicate IDs: {len(duplicate_ids)}")

# Check if there are completely duplicate rows
duplicate_rows = df[df.duplicated()]
print(f"Number of completely duplicate rows: {len(duplicate_rows)}")

Output:

Number of duplicate IDs: 0
Number of completely duplicate rows: 0

Creating a Validation Framework

For more complex datasets, it's helpful to create a systematic validation framework:

python
def validate_dataframe(df, rules):
    """
    Validate a DataFrame against a set of rules
    
    Parameters:
    df (DataFrame): The DataFrame to validate
    rules (dict): Dictionary of column names and validation rules
    
    Returns:
    dict: Dictionary of validation results
    """
    results = {}
    
    for column, column_rules in rules.items():
        if column not in df.columns:
            results[column] = {"present": False}
            continue
            
        column_results = {"present": True}
        
        # Check data type
        if "dtype" in column_rules:
            expected_type = column_rules["dtype"]
            actual_type = df[column].dtype
            column_results["correct_type"] = (actual_type == expected_type)
        
        # Check for nulls
        if "allow_nulls" in column_rules:
            allow_nulls = column_rules["allow_nulls"]
            has_nulls = df[column].isna().any()
            column_results["nulls_valid"] = (not has_nulls) if not allow_nulls else True
        
        # Check for unique values
        if "unique" in column_rules:
            should_be_unique = column_rules["unique"]
            is_unique = df[column].is_unique
            column_results["uniqueness_valid"] = is_unique if should_be_unique else True
        
        # Custom validation function
        if "custom_validation" in column_rules:
            validation_func = column_rules["custom_validation"]
            column_results["custom_valid"] = validation_func(df[column])
        
        results[column] = column_results
    
    return results

# Example usage
validation_rules = {
    "id": {
        "dtype": "int64",
        "allow_nulls": False,
        "unique": True
    },
    "name": {
        "dtype": "object",
        "allow_nulls": False
    },
    "age_numeric": {
        "dtype": "float64",
        "allow_nulls": True,
        "custom_validation": lambda x: (x[x.notna()] >= 0).all() and (x[x.notna()] <= 120).all()
    },
    "email": {
        "dtype": "object",
        "custom_validation": lambda x: x.str.contains('@').all() and x.str.contains('\.').all()
    }
}

validation_results = validate_dataframe(df, validation_rules)

# Print the results in a readable format
for column, results in validation_results.items():
    print(f"\nColumn: {column}")
    for check, result in results.items():
        print(f"  {check}: {result}")

Output:

Column: id
  present: True
  correct_type: True
  nulls_valid: True
  uniqueness_valid: True

Column: name
  present: True
  correct_type: True
  nulls_valid: True

Column: age_numeric
  present: True
  correct_type: True
  nulls_valid: True
  custom_valid: True

Column: email
  present: True
  correct_type: True
  custom_valid: False

Real-World Example: Validating Customer Data

Let's apply validation to a more realistic dataset:

python
# Create a more complex customer dataset
customers = pd.DataFrame({
    'customer_id': [1001, 1002, 1003, 1004, 1005],
    'name': ['John Smith', 'Jane Doe', np.nan, 'Robert Johnson', 'Lisa Brown'],
    'age': [35, 42, 28, -5, 31],  # Note the negative age
    'email': ['[email protected]', '[email protected]', 'david@sample', '[email protected]', '[email protected]'],
    'phone': ['555-1234', '555-5678', '555-9012', '5551234', '555-7890'],
    'signup_date': ['2022-01-15', '2022-02-20', '2022-03-10', '2022/04/05', '2022-05-12']
})

print(customers)

Output:

   customer_id          name  age              email     phone  signup_date
      1001    John Smith   35   [email protected]  555-1234   2022-01-15
      1002      Jane Doe   42   [email protected]  555-5678   2022-02-20
      1003           NaN   28     david@sample  555-9012   2022-03-10
      1004  Robert Johnson   -5  [email protected]  5551234   2022/04/05
      1005    Lisa Brown   31   [email protected]  555-7890   2022-05-12

Now let's validate this data:

python
# Define validation rules for customer data
customer_rules = {
    'customer_id': {
        'dtype': 'int64',
        'allow_nulls': False,
        'unique': True,
        'custom_validation': lambda x: (x > 1000).all()  # IDs must be > 1000
    },
    'name': {
        'dtype': 'object',
        'allow_nulls': False
    },
    'age': {
        'dtype': 'int64',
        'allow_nulls': False,
        'custom_validation': lambda x: (x >= 0).all() and (x <= 120).all()  # Reasonable age range
    },
    'email': {
        'dtype': 'object',
        'allow_nulls': False,
        'custom_validation': lambda x: x.str.contains('@').all() and x.str.endswith('.com').all()
    },
    'phone': {
        'dtype': 'object',
        'allow_nulls': False,
        'custom_validation': lambda x: x.str.contains('-').all()  # Simplified validation
    },
    'signup_date': {
        'dtype': 'object',
        'allow_nulls': False,
        'custom_validation': lambda x: x.str.contains('-').all()  # Checking consistent format
    }
}

# Validate the customer data
customer_validation = validate_dataframe(customers, customer_rules)

# Print the validation results
issues_found = False
for column, results in customer_validation.items():
    if not all(results.values()):
        issues_found = True
        print(f"\nIssues found in column: {column}")
        for check, result in results.items():
            if not result:
                print(f"  Failed check: {check}")

if not issues_found:
    print("All validation checks passed!")

Output:

Issues found in column: name
  Failed check: nulls_valid

Issues found in column: age
  Failed check: custom_valid

Issues found in column: email
  Failed check: custom_valid

Issues found in column: phone
  Failed check: custom_valid

Issues found in column: signup_date
  Failed check: custom_valid

Based on the validation results, we need to address several issues:

Missing name for customer 1003
Negative age value for customer 1004
Invalid email format for customer 1003
Inconsistent phone number formatting
Inconsistent date format

Let's fix these issues:

python
# Fix the issues
fixed_customers = customers.copy()

# Fix missing name
fixed_customers.loc[2, 'name'] = 'David Wilson'  # Assuming we found the correct name

# Fix negative age
fixed_customers.loc[3, 'age'] = 45  # Correcting to a positive value

# Fix emails
fixed_customers.loc[2, 'email'] = '[email protected]'  # Adding .com domain

# Fix phone formatting
fixed_customers['phone'] = fixed_customers['phone'].str.replace(r'^(\d{3})(\d{4})$', r'\1-\2', regex=True)

# Fix date formatting
fixed_customers['signup_date'] = fixed_customers['signup_date'].str.replace('/', '-')

print("Fixed customer data:")
print(fixed_customers)

# Re-validate
fixed_validation = validate_dataframe(fixed_customers, customer_rules)

# Check if we have any remaining issues
remaining_issues = False
for column, results in fixed_validation.items():
    if not all(results.values()):
        remaining_issues = True
        print(f"\nRemaining issues in column: {column}")
        for check, result in results.items():
            if not result:
                print(f"  Failed check: {check}")

if not remaining_issues:
    print("\nAll validation checks now pass!")

Output:

Fixed customer data:
   customer_id          name  age              email     phone  signup_date
0        1001    John Smith   35   [email protected]  555-1234   2022-01-15
1        1002      Jane Doe   42   [email protected]  555-5678   2022-02-20
2        1003  David Wilson   28  [email protected]  555-9012   2022-03-10
3        1004  Robert Johnson   45  [email protected]  555-1234   2022-04-05
4        1005    Lisa Brown   31   [email protected]  555-7890   2022-05-12

All validation checks now pass!

Using pandas-schema for Validation

For more complex validation needs, you can use the pandas-schema library:

python
# pip install pandas-schema
from pandas_schema import Schema, Column
from pandas_schema.validation import CustomElementValidation, MatchesPatternValidation
from pandas_schema.validation import InRangeValidation, IsDtypeValidation

# Define a validation schema
schema = Schema([
    Column('customer_id', [
        IsDtypeValidation(int),
        InRangeValidation(1000, 9999)
    ]),
    Column('name', [
        CustomElementValidation(lambda x: x is not None and x != '', 'name cannot be empty')
    ]),
    Column('age', [
        IsDtypeValidation(int),
        InRangeValidation(0, 120)
    ]),
    Column('email', [
        MatchesPatternValidation(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
    ]),
    Column('phone', [
        MatchesPatternValidation(r'^\d{3}-\d{4}$')
    ])
])

# Validate the fixed data
errors = schema.validate(fixed_customers)
if errors:
    print("Validation errors:")
    for error in errors:
        print(error)
else:
    print("No validation errors found with pandas-schema!")

Summary

Data validation is a critical step in the data cleaning process. In this tutorial, we learned:

How to check data types and identify invalid values
How to validate value ranges, patterns, and uniqueness
How to create a custom validation framework
How to apply validation to real-world datasets
How to use specialized libraries like pandas-schema

By implementing these validation techniques, you'll catch data quality issues early, ensuring your analyses are built on solid foundations.

Additional Resources

Exercises

Create a validation framework for a dataset containing product information (product_id, name, price, category, in_stock).
Write custom validation functions for validating:
- Email addresses with specific domain requirements
- Phone numbers in international format
- ZIP/postal codes for your country
Apply the validation techniques you've learned to your own dataset and document the issues you find.
Extend the validation framework to generate a report with the percentage of valid data for each column.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Data Validation​

Basic Data Validation Techniques​

1. Checking Data Types​

2. Identifying Invalid Values​

Advanced Validation Techniques​

1. Value Range Validation​

2. Pattern Matching for Strings​

3. Uniqueness Constraints​

Creating a Validation Framework​

Real-World Example: Validating Customer Data​

Using pandas-schema for Validation​

Summary​

Additional Resources​

Exercises​

Introduction to Data Validation

Basic Data Validation Techniques

1. Checking Data Types

2. Identifying Invalid Values

Advanced Validation Techniques

1. Value Range Validation

2. Pattern Matching for Strings

3. Uniqueness Constraints

Creating a Validation Framework

Real-World Example: Validating Customer Data

Using pandas-schema for Validation

Summary

Additional Resources

Exercises