Pandas Data Validation
Data validation is a crucial step in any data analysis workflow. Before you can trust your analysis results, you must ensure that your data is clean, accurate, and meets your expectations. In this tutorial, we'll explore different techniques for validating data in pandas.
Introduction to Data Validation
Data validation is the process of ensuring your data meets certain quality standards before analysis. It helps identify issues like:
- Missing values
- Incorrect data types
- Values outside expected ranges
- Inconsistent formatting
- Duplicate records
- Logical inconsistencies
By implementing proper validation, you can prevent errors in your analysis and ensure your conclusions are reliable.
Basic Data Validation Techniques
1. Checking Data Types
One of the first validation steps is ensuring columns have the correct data types.
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
'age': [25, 'thirty', 35, 40, 45], # Note the string in a numeric column
'email': ['[email protected]', 'bob@example', '[email protected]', '[email protected]', '[email protected]']
})
# Check data types
print(df.dtypes)
Output:
id int64
name object
age object # Should be numeric, but contains a string
email object
dtype: object
Notice that age
is of type object
(pandas' way of handling strings) instead of numeric because of the non-numeric value 'thirty'.
2. Identifying Invalid Values
You can identify invalid numeric values using pandas functions:
# Try to convert age to numeric, with errors='coerce' to set invalid values to NaN
df['age_numeric'] = pd.to_numeric(df['age'], errors='coerce')
print(df)
# Now we can see which rows have invalid age values
print("Rows with invalid ages:")
print(df[df['age_numeric'].isna()])
Output:
id name age email age_numeric
0 1 Alice 25 [email protected] 25.0
1 2 Bob thirty bob@example NaN
2 3 Charlie 35 [email protected] 35.0
3 4 Dave 40 [email protected] 40.0
4 5 Eve 45 [email protected] 45.0
Rows with invalid ages:
id name age email age_numeric
1 2 Bob thirty bob@example NaN
Advanced Validation Techniques
1. Value Range Validation
Ensure numeric values fall within expected ranges:
# Check if ages are within a reasonable range (e.g., 0-120)
valid_age_mask = (df['age_numeric'] >= 0) & (df['age_numeric'] <= 120)
invalid_ages = df[~valid_age_mask & df['age_numeric'].notna()]
print("Ages outside valid range:")
print(invalid_ages)
# No output because all valid ages are within range
2. Pattern Matching for Strings
Validate string columns using regular expressions:
import re
# Define a simple email validation pattern
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
# Create a function to validate emails
def is_valid_email(email):
if pd.isna(email):
return False
return bool(re.match(email_pattern, email))
# Apply the validation function
df['valid_email'] = df['email'].apply(is_valid_email)
print(df[['email', 'valid_email']])
Output:
email valid_email
0 [email protected] True
1 bob@example False
2 [email protected] True
3 [email protected] True
4 [email protected] True
3. Uniqueness Constraints
Check for duplicate values in columns that should be unique:
# Check if IDs are unique
duplicate_ids = df[df['id'].duplicated()]
print(f"Number of duplicate IDs: {len(duplicate_ids)}")
# Check if there are completely duplicate rows
duplicate_rows = df[df.duplicated()]
print(f"Number of completely duplicate rows: {len(duplicate_rows)}")
Output:
Number of duplicate IDs: 0
Number of completely duplicate rows: 0
Creating a Validation Framework
For more complex datasets, it's helpful to create a systematic validation framework:
def validate_dataframe(df, rules):
"""
Validate a DataFrame against a set of rules
Parameters:
df (DataFrame): The DataFrame to validate
rules (dict): Dictionary of column names and validation rules
Returns:
dict: Dictionary of validation results
"""
results = {}
for column, column_rules in rules.items():
if column not in df.columns:
results[column] = {"present": False}
continue
column_results = {"present": True}
# Check data type
if "dtype" in column_rules:
expected_type = column_rules["dtype"]
actual_type = df[column].dtype
column_results["correct_type"] = (actual_type == expected_type)
# Check for nulls
if "allow_nulls" in column_rules:
allow_nulls = column_rules["allow_nulls"]
has_nulls = df[column].isna().any()
column_results["nulls_valid"] = (not has_nulls) if not allow_nulls else True
# Check for unique values
if "unique" in column_rules:
should_be_unique = column_rules["unique"]
is_unique = df[column].is_unique
column_results["uniqueness_valid"] = is_unique if should_be_unique else True
# Custom validation function
if "custom_validation" in column_rules:
validation_func = column_rules["custom_validation"]
column_results["custom_valid"] = validation_func(df[column])
results[column] = column_results
return results
# Example usage
validation_rules = {
"id": {
"dtype": "int64",
"allow_nulls": False,
"unique": True
},
"name": {
"dtype": "object",
"allow_nulls": False
},
"age_numeric": {
"dtype": "float64",
"allow_nulls": True,
"custom_validation": lambda x: (x[x.notna()] >= 0).all() and (x[x.notna()] <= 120).all()
},
"email": {
"dtype": "object",
"custom_validation": lambda x: x.str.contains('@').all() and x.str.contains('\.').all()
}
}
validation_results = validate_dataframe(df, validation_rules)
# Print the results in a readable format
for column, results in validation_results.items():
print(f"\nColumn: {column}")
for check, result in results.items():
print(f" {check}: {result}")
Output:
Column: id
present: True
correct_type: True
nulls_valid: True
uniqueness_valid: True
Column: name
present: True
correct_type: True
nulls_valid: True
Column: age_numeric
present: True
correct_type: True
nulls_valid: True
custom_valid: True
Column: email
present: True
correct_type: True
custom_valid: False
Real-World Example: Validating Customer Data
Let's apply validation to a more realistic dataset:
# Create a more complex customer dataset
customers = pd.DataFrame({
'customer_id': [1001, 1002, 1003, 1004, 1005],
'name': ['John Smith', 'Jane Doe', np.nan, 'Robert Johnson', 'Lisa Brown'],
'age': [35, 42, 28, -5, 31], # Note the negative age
'email': ['[email protected]', '[email protected]', 'david@sample', '[email protected]', '[email protected]'],
'phone': ['555-1234', '555-5678', '555-9012', '5551234', '555-7890'],
'signup_date': ['2022-01-15', '2022-02-20', '2022-03-10', '2022/04/05', '2022-05-12']
})
print(customers)
Output:
customer_id name age email phone signup_date
0 1001 John Smith 35 [email protected] 555-1234 2022-01-15
1 1002 Jane Doe 42 [email protected] 555-5678 2022-02-20
2 1003 NaN 28 david@sample 555-9012 2022-03-10
3 1004 Robert Johnson -5 [email protected] 5551234 2022/04/05
4 1005 Lisa Brown 31 [email protected] 555-7890 2022-05-12
Now let's validate this data:
# Define validation rules for customer data
customer_rules = {
'customer_id': {
'dtype': 'int64',
'allow_nulls': False,
'unique': True,
'custom_validation': lambda x: (x > 1000).all() # IDs must be > 1000
},
'name': {
'dtype': 'object',
'allow_nulls': False
},
'age': {
'dtype': 'int64',
'allow_nulls': False,
'custom_validation': lambda x: (x >= 0).all() and (x <= 120).all() # Reasonable age range
},
'email': {
'dtype': 'object',
'allow_nulls': False,
'custom_validation': lambda x: x.str.contains('@').all() and x.str.endswith('.com').all()
},
'phone': {
'dtype': 'object',
'allow_nulls': False,
'custom_validation': lambda x: x.str.contains('-').all() # Simplified validation
},
'signup_date': {
'dtype': 'object',
'allow_nulls': False,
'custom_validation': lambda x: x.str.contains('-').all() # Checking consistent format
}
}
# Validate the customer data
customer_validation = validate_dataframe(customers, customer_rules)
# Print the validation results
issues_found = False
for column, results in customer_validation.items():
if not all(results.values()):
issues_found = True
print(f"\nIssues found in column: {column}")
for check, result in results.items():
if not result:
print(f" Failed check: {check}")
if not issues_found:
print("All validation checks passed!")
Output:
Issues found in column: name
Failed check: nulls_valid
Issues found in column: age
Failed check: custom_valid
Issues found in column: email
Failed check: custom_valid
Issues found in column: phone
Failed check: custom_valid
Issues found in column: signup_date
Failed check: custom_valid
Based on the validation results, we need to address several issues:
- Missing name for customer 1003
- Negative age value for customer 1004
- Invalid email format for customer 1003
- Inconsistent phone number formatting
- Inconsistent date format
Let's fix these issues:
# Fix the issues
fixed_customers = customers.copy()
# Fix missing name
fixed_customers.loc[2, 'name'] = 'David Wilson' # Assuming we found the correct name
# Fix negative age
fixed_customers.loc[3, 'age'] = 45 # Correcting to a positive value
# Fix emails
fixed_customers.loc[2, 'email'] = '[email protected]' # Adding .com domain
# Fix phone formatting
fixed_customers['phone'] = fixed_customers['phone'].str.replace(r'^(\d{3})(\d{4})$', r'\1-\2', regex=True)
# Fix date formatting
fixed_customers['signup_date'] = fixed_customers['signup_date'].str.replace('/', '-')
print("Fixed customer data:")
print(fixed_customers)
# Re-validate
fixed_validation = validate_dataframe(fixed_customers, customer_rules)
# Check if we have any remaining issues
remaining_issues = False
for column, results in fixed_validation.items():
if not all(results.values()):
remaining_issues = True
print(f"\nRemaining issues in column: {column}")
for check, result in results.items():
if not result:
print(f" Failed check: {check}")
if not remaining_issues:
print("\nAll validation checks now pass!")
Output:
Fixed customer data:
customer_id name age email phone signup_date
0 1001 John Smith 35 [email protected] 555-1234 2022-01-15
1 1002 Jane Doe 42 [email protected] 555-5678 2022-02-20
2 1003 David Wilson 28 [email protected] 555-9012 2022-03-10
3 1004 Robert Johnson 45 [email protected] 555-1234 2022-04-05
4 1005 Lisa Brown 31 [email protected] 555-7890 2022-05-12
All validation checks now pass!
Using pandas-schema for Validation
For more complex validation needs, you can use the pandas-schema
library:
# pip install pandas-schema
from pandas_schema import Schema, Column
from pandas_schema.validation import CustomElementValidation, MatchesPatternValidation
from pandas_schema.validation import InRangeValidation, IsDtypeValidation
# Define a validation schema
schema = Schema([
Column('customer_id', [
IsDtypeValidation(int),
InRangeValidation(1000, 9999)
]),
Column('name', [
CustomElementValidation(lambda x: x is not None and x != '', 'name cannot be empty')
]),
Column('age', [
IsDtypeValidation(int),
InRangeValidation(0, 120)
]),
Column('email', [
MatchesPatternValidation(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
]),
Column('phone', [
MatchesPatternValidation(r'^\d{3}-\d{4}$')
])
])
# Validate the fixed data
errors = schema.validate(fixed_customers)
if errors:
print("Validation errors:")
for error in errors:
print(error)
else:
print("No validation errors found with pandas-schema!")
Summary
Data validation is a critical step in the data cleaning process. In this tutorial, we learned:
- How to check data types and identify invalid values
- How to validate value ranges, patterns, and uniqueness
- How to create a custom validation framework
- How to apply validation to real-world datasets
- How to use specialized libraries like pandas-schema
By implementing these validation techniques, you'll catch data quality issues early, ensuring your analyses are built on solid foundations.
Additional Resources
Exercises
- Create a validation framework for a dataset containing product information (product_id, name, price, category, in_stock).
- Write custom validation functions for validating:
- Email addresses with specific domain requirements
- Phone numbers in international format
- ZIP/postal codes for your country
- Apply the validation techniques you've learned to your own dataset and document the issues you find.
- Extend the validation framework to generate a report with the percentage of valid data for each column.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)