Python Data Cleaning
Data cleaning is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies from datasets to improve their quality. In the data science workflow, data cleaning is a crucial step that often consumes a significant portion of a data scientist's time.
Why Data Cleaning Matters
Raw data is often messy:
- It may contain missing values
- It may have duplicate entries
- Values might be in incorrect formats
- There could be outliers that skew analysis
- Text data might need normalization
Clean data leads to more accurate analyses, more reliable models, and ultimately better decisions. Let's explore how to clean data effectively using Python.
Setting Up Your Environment
First, let's import the libraries we'll need:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Loading a Sample Dataset
For our examples, we'll use a sample dataset with common data quality issues:
# Create a sample messy dataset
data = {
    'customer_id': [1001, 1002, 1002, None, 1004, 1005, 1006],
    'name': ['John Smith', 'Jane Doe', 'Jane Doe', 'Bob Johnson', None, 'Sarah Wilson', 'MIKE BROWN'],
    'email': ['[email protected]', '[email protected]', '[email protected]', None, 'alice@example', '[email protected]', '[email protected]'],
    'age': [34, 28, 28, 45, 32, -999, 200],
    'signup_date': ['2021-01-15', '2021/02/20', '2021/02/20', 'January 30, 2021', '2021-04-05', '2021-05-12', '2021-06-01'],
    'last_purchase': ['$123.45', '$67.8', '$67.80', '$890.0', None, '$45.20', '$200.5']
}
df = pd.DataFrame(data)
print(df)
Output:
   customer_id        name             email  age       signup_date last_purchase
0       1001.0  John Smith  [email protected]   34      2021-01-15       $123.45
1       1002.0    Jane Doe  [email protected]   28      2021/02/20        $67.8
2       1002.0    Jane Doe  [email protected]   28      2021/02/20       $67.80
3          NaN  Bob Johnson             None   45  January 30, 2021       $890.0
4       1004.0        None    alice@example   32      2021-04-05          None
5       1005.0  Sarah Wilson  [email protected] -999      2021-05-12        $45.20
6       1006.0  MIKE BROWN   [email protected]  200      2021-06-01        $200.5
Exploring the Dataset
Before cleaning, it's important to understand what issues exist in your data:
# Get basic information about the dataset
print("Dataset information:")
print(df.info())
print("\nDescriptive statistics:")
print(df.describe())
print("\nMissing values:")
print(df.isna().sum())
Output:
Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customer_id    6 non-null      float64
 1   name           6 non-null      object 
 2   email          5 non-null      object 
 3   age            7 non-null      int64  
 4   signup_date    7 non-null      object 
 5   last_purchase  6 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 464.0+ bytes
None
Descriptive statistics:
       customer_id          age
count     6.000000     7.000000
mean   1003.333333    -8.857143
std       1.861899   309.766099
min    1001.000000  -999.000000
25%    1002.000000    28.000000
50%    1003.000000    34.000000
75%    1005.000000    45.000000
max    1006.000000   200.000000
Missing values:
customer_id    1
name           1
email          2
age            0
signup_date    0
last_purchase  1
dtype: int64
Common Data Cleaning Tasks
Let's address the main data cleaning tasks one by one:
1. Handling Missing Values
First, let's identify and deal with missing values:
# Check for missing values
print("Missing values before cleaning:")
print(df.isna().sum())
# Fill missing customer_id with a placeholder
df['customer_id'] = df['customer_id'].fillna(-1)
# Fill missing names with "Unknown"
df['name'] = df['name'].fillna("Unknown")
# Drop rows with missing email (if email is critical)
df_email_cleaned = df.dropna(subset=['email'])
print("\nDataset after dropping rows with missing emails:")
print(df_email_cleaned.shape)
# Alternative: Fill missing emails with a placeholder
df['email'] = df['email'].fillna("[email protected]")
# Fill missing last_purchase with 0
df['last_purchase'] = df['last_purchase'].fillna("$0")
print("\nMissing values after cleaning:")
print(df.isna().sum())
Output:
Missing values before cleaning:
customer_id    1
name           1
email          2
age            0
signup_date    0
last_purchase  1
dtype: int64
Dataset after dropping rows with missing emails:
(5, 6)
Missing values after cleaning:
customer_id    0
name           0
email          0
age            0
signup_date    0
last_purchase  0
dtype: int64
2. Removing Duplicates
Now let's identify and remove duplicate records:
# Check for duplicate rows
print(f"Duplicate rows: {df.duplicated().sum()}")
# Check for duplicates based on specific columns
print(f"Duplicate customer_id entries: {df.duplicated(subset=['customer_id']).sum()}")
# Remove duplicate rows (keep first occurrence)
df_no_dupes = df.drop_duplicates()
print(f"\nDataset shape before removing duplicates: {df.shape}")
print(f"Dataset shape after removing duplicates: {df_no_dupes.shape}")
# Remove duplicates based on specific columns (e.g., customer_id)
df_unique_customers = df.drop_duplicates(subset=['customer_id'])
print(f"Dataset after removing duplicate customers: {df_unique_customers.shape}")
Output:
Duplicate rows: 1
Duplicate customer_id entries: 1
Dataset shape before removing duplicates: (7, 6)
Dataset shape after removing duplicates: (6, 6)
Dataset after removing duplicate customers: (6, 6)
3. Fixing Data Types
Let's correct data types for better analysis:
# Convert customer_id to integer
df['customer_id'] = df['customer_id'].astype('Int64')  # Int64 allows for NaN values
# Clean up and convert the signup_date to datetime
def parse_date(date_str):
    try:
        return pd.to_datetime(date_str)
    except:
        return pd.NaT
df['signup_date'] = df['signup_date'].apply(parse_date)
# Convert last_purchase from string to float (removing '$')
df['last_purchase'] = df['last_purchase'].str.replace('$', '').astype(float)
print(df.dtypes)
Output:
customer_id             Int64
name                   object
email                  object
age                     int64
signup_date    datetime64[ns]
last_purchase         float64
dtype: object
4. Handling Outliers and Invalid Values
Let's detect and handle outliers and invalid values:
# Identify outliers in age using IQR method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Age outliers bounds: lower = {lower_bound}, upper = {upper_bound}")
outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print("Outliers in age:")
print(outliers[['customer_id', 'name', 'age']])
# Replace invalid ages with median
median_age = df['age'].median()
df.loc[df['age'] < 0, 'age'] = median_age  # Replace negative ages
df.loc[df['age'] > 120, 'age'] = median_age  # Replace unrealistically high ages
print("\nAge statistics after fixing outliers:")
print(df['age'].describe())
Output:
Age outliers bounds: lower = 2.5, upper = 70.5
Outliers in age:
   customer_id        name  age
5        1005  Sarah Wilson -999
6        1006  MIKE BROWN    200
Age statistics after fixing outliers:
count     7.000000
mean     33.571429
std       6.800735
min      28.000000
25%      28.000000
50%      34.000000
75%      34.000000
max      45.000000
Name: age, dtype: float64
5. Standardizing Text Data
Text data often needs standardization:
# Standardize case in name (convert to title case)
df['name'] = df['name'].str.title()
# Check for invalid email formats
import re
def is_valid_email(email):
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    if pd.isna(email):
        return False
    return bool(re.match(pattern, email))
df['valid_email'] = df['email'].apply(is_valid_email)
print("Invalid emails:")
print(df[~df['valid_email']][['name', 'email']])
# Clean up email addresses (for this example, just add domain if missing)
def fix_email(email):
    if pd.isna(email) or '@' not in email:
        return "[email protected]"
    if not email.endswith('.com'):
        return email + '.com'
    return email
df['email'] = df['email'].apply(fix_email)
Output:
Invalid emails:
         name           email
4     Unknown  alice@example
Real-World Data Cleaning Example
Let's apply these techniques to a more realistic scenario using the Titanic dataset:
# Load the Titanic dataset
titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print(titanic.head())
# Get an overview
print("\nDataset info:")
print(titanic.info())
print("\nMissing values:")
print(titanic.isna().sum())
# Data cleaning steps
titanic_clean = titanic.copy()
# 1. Handle missing values
# Fill missing Age with median
titanic_clean['Age'] = titanic_clean['Age'].fillna(titanic_clean['Age'].median())
# Fill missing Embarked with most frequent value
most_common_embarked = titanic_clean['Embarked'].mode()[0]
titanic_clean['Embarked'] = titanic_clean['Embarked'].fillna(most_common_embarked)
# Drop Cabin column (too many missing values)
titanic_clean = titanic_clean.drop('Cabin', axis=1)
# 2. Feature engineering
# Extract titles from names
titanic_clean['Title'] = titanic_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
# Consolidate titles
title_mapping = {
    "Mr": "Mr",
    "Miss": "Miss",
    "Mrs": "Mrs",
    "Master": "Master",
    "Dr": "Other",
    "Rev": "Other",
    "Col": "Other",
    "Major": "Other",
    "Mlle": "Miss",
    "Ms": "Miss",
    "Countess": "Mrs",
    "Lady": "Mrs",
    "Jonkheer": "Other",
    "Don": "Mr",
    "Dona": "Mrs",
    "Mme": "Mrs",
    "Capt": "Other",
    "Sir": "Other"
}
titanic_clean['Title'] = titanic_clean['Title'].map(title_mapping)
# 3. Convert categorical features
titanic_clean['Sex'] = titanic_clean['Sex'].map({'female': 1, 'male': 0})
# One-hot encode Embarked
embarked_dummies = pd.get_dummies(titanic_clean['Embarked'], prefix='Embarked')
titanic_clean = pd.concat([titanic_clean, embarked_dummies], axis=1)
# Check results
print("\nCleaned dataset info:")
print(titanic_clean.info())
print("\nCleaned dataset sample:")
print(titanic_clean.head())
Best Practices for Data Cleaning
- Always make a copy of your original data before cleaning
- Document all transformations applied to your data
- Validate your cleaning by checking data ranges and distributions
- Automate repetitive cleaning tasks with functions
- Handle missing data carefully - dropping rows may introduce bias
- Check for outliers before and after cleaning
- Standardize text data, dates, and categorical variables
- Use consistent naming conventions for columns and values
Creating a Data Cleaning Pipeline
For repetitive cleaning tasks, create a pipeline:
def clean_customer_data(df):
    """
    A function to clean customer data
    """
    df = df.copy()  # Work with a copy
    
    # Handle missing values
    df['customer_id'] = df['customer_id'].fillna(-1).astype('Int64')
    df['name'] = df['name'].fillna("Unknown").str.title()
    df['email'] = df['email'].fillna("[email protected]")
    
    # Fix data types
    df['signup_date'] = df['signup_date'].apply(lambda x: pd.to_datetime(x, errors='coerce'))
    
    # Remove duplicates
    df = df.drop_duplicates(subset=['customer_id'])
    
    # Handle outliers in age
    median_age = df['age'].median()
    df.loc[df['age'] < 0, 'age'] = median_age
    df.loc[df['age'] > 120, 'age'] = median_age
    
    return df
# Use the pipeline
clean_df = clean_customer_data(df)
print("Cleaned dataframe:")
print(clean_df.head())
Summary
Data cleaning is a critical step in any data science project. In this guide, we've covered:
- Identifying and handling missing values
- Removing duplicate data
- Fixing data types and formats
- Handling outliers and invalid values
- Standardizing text data
- Creating data cleaning pipelines
Remember that data cleaning is often iterative - as you analyze your data, you may discover new issues that need addressing. The goal is to create a dataset that is accurate, consistent, and ready for analysis.
Additional Resources
Exercises
- Download a messy dataset of your choice and apply the techniques from this guide.
- Create a function to detect and remove outliers using the Z-score method.
- Develop a data validation function that checks if your cleaned data meets specific criteria.
- Apply regular expressions to clean a dataset with messy text fields.
- Create a complete data cleaning pipeline for a specific type of dataset (e.g., financial data, customer data, or medical data).
Happy cleaning!
💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!