Python Data Cleaning

Data cleaning is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies from datasets to improve their quality. In the data science workflow, data cleaning is a crucial step that often consumes a significant portion of a data scientist's time.

Why Data Cleaning Matters

Raw data is often messy:

It may contain missing values
It may have duplicate entries
Values might be in incorrect formats
There could be outliers that skew analysis
Text data might need normalization

Clean data leads to more accurate analyses, more reliable models, and ultimately better decisions. Let's explore how to clean data effectively using Python.

Setting Up Your Environment

First, let's import the libraries we'll need:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading a Sample Dataset

For our examples, we'll use a sample dataset with common data quality issues:

# Create a sample messy dataset
data = {
    'customer_id': [1001, 1002, 1002, None, 1004, 1005, 1006],
    'name': ['John Smith', 'Jane Doe', 'Jane Doe', 'Bob Johnson', None, 'Sarah Wilson', 'MIKE BROWN'],
    'email': ['[email protected]', '[email protected]', '[email protected]', None, 'alice@example', '[email protected]', '[email protected]'],
    'age': [34, 28, 28, 45, 32, -999, 200],
    'signup_date': ['2021-01-15', '2021/02/20', '2021/02/20', 'January 30, 2021', '2021-04-05', '2021-05-12', '2021-06-01'],
    'last_purchase': ['$123.45', '$67.8', '$67.80', '$890.0', None, '$45.20', '$200.5']
}

df = pd.DataFrame(data)
print(df)

Output:

   customer_id        name             email  age       signup_date last_purchase
     1001.0  John Smith  [email protected]   34      2021-01-15       $123.45
     1002.0    Jane Doe  [email protected]   28      2021/02/20        $67.8
     1002.0    Jane Doe  [email protected]   28      2021/02/20       $67.80
        NaN  Bob Johnson             None   45  January 30, 2021       $890.0
     1004.0        None    alice@example   32      2021-04-05          None
     1005.0  Sarah Wilson  [email protected] -999      2021-05-12        $45.20
     1006.0  MIKE BROWN   [email protected]  200      2021-06-01        $200.5

Exploring the Dataset

Before cleaning, it's important to understand what issues exist in your data:

# Get basic information about the dataset
print("Dataset information:")
print(df.info())
print("\nDescriptive statistics:")
print(df.describe())
print("\nMissing values:")
print(df.isna().sum())

Output:

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customer_id    6 non-null      float64
 1   name           6 non-null      object 
 2   email          5 non-null      object 
 3   age            7 non-null      int64  
 4   signup_date    7 non-null      object 
 5   last_purchase  6 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 464.0+ bytes
None

Descriptive statistics:
       customer_id          age
count     6.000000     7.000000
mean   1003.333333    -8.857143
std       1.861899   309.766099
min    1001.000000  -999.000000
25%    1002.000000    28.000000
50%    1003.000000    34.000000
75%    1005.000000    45.000000
max    1006.000000   200.000000

Missing values:
customer_id    1
name           1
email          2
age            0
signup_date    0
last_purchase  1
dtype: int64

Common Data Cleaning Tasks

Let's address the main data cleaning tasks one by one:

1. Handling Missing Values

First, let's identify and deal with missing values:

# Check for missing values
print("Missing values before cleaning:")
print(df.isna().sum())

# Fill missing customer_id with a placeholder
df['customer_id'] = df['customer_id'].fillna(-1)

# Fill missing names with "Unknown"
df['name'] = df['name'].fillna("Unknown")

# Drop rows with missing email (if email is critical)
df_email_cleaned = df.dropna(subset=['email'])
print("\nDataset after dropping rows with missing emails:")
print(df_email_cleaned.shape)

# Alternative: Fill missing emails with a placeholder
df['email'] = df['email'].fillna("[email protected]")

# Fill missing last_purchase with 0
df['last_purchase'] = df['last_purchase'].fillna("$0")

print("\nMissing values after cleaning:")
print(df.isna().sum())

Output:

Missing values before cleaning:
customer_id    1
name           1
email          2
age            0
signup_date    0
last_purchase  1
dtype: int64

Dataset after dropping rows with missing emails:
(5, 6)

Missing values after cleaning:
customer_id    0
name           0
email          0
age            0
signup_date    0
last_purchase  0
dtype: int64

2. Removing Duplicates

Now let's identify and remove duplicate records:

# Check for duplicate rows
print(f"Duplicate rows: {df.duplicated().sum()}")

# Check for duplicates based on specific columns
print(f"Duplicate customer_id entries: {df.duplicated(subset=['customer_id']).sum()}")

# Remove duplicate rows (keep first occurrence)
df_no_dupes = df.drop_duplicates()
print(f"\nDataset shape before removing duplicates: {df.shape}")
print(f"Dataset shape after removing duplicates: {df_no_dupes.shape}")

# Remove duplicates based on specific columns (e.g., customer_id)
df_unique_customers = df.drop_duplicates(subset=['customer_id'])
print(f"Dataset after removing duplicate customers: {df_unique_customers.shape}")

Output:

Duplicate rows: 1
Duplicate customer_id entries: 1

Dataset shape before removing duplicates: (7, 6)
Dataset shape after removing duplicates: (6, 6)
Dataset after removing duplicate customers: (6, 6)

3. Fixing Data Types

Let's correct data types for better analysis:

# Convert customer_id to integer
df['customer_id'] = df['customer_id'].astype('Int64')  # Int64 allows for NaN values

# Clean up and convert the signup_date to datetime
def parse_date(date_str):
    try:
        return pd.to_datetime(date_str)
    except:
        return pd.NaT

df['signup_date'] = df['signup_date'].apply(parse_date)

# Convert last_purchase from string to float (removing '$')
df['last_purchase'] = df['last_purchase'].str.replace('$', '').astype(float)

print(df.dtypes)

Output:

customer_id             Int64
name                   object
email                  object
age                     int64
signup_date    datetime64[ns]
last_purchase         float64
dtype: object

4. Handling Outliers and Invalid Values

Let's detect and handle outliers and invalid values:

# Identify outliers in age using IQR method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Age outliers bounds: lower = {lower_bound}, upper = {upper_bound}")
outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print("Outliers in age:")
print(outliers[['customer_id', 'name', 'age']])

# Replace invalid ages with median
median_age = df['age'].median()
df.loc[df['age'] < 0, 'age'] = median_age  # Replace negative ages
df.loc[df['age'] > 120, 'age'] = median_age  # Replace unrealistically high ages

print("\nAge statistics after fixing outliers:")
print(df['age'].describe())

Output:

Age outliers bounds: lower = 2.5, upper = 70.5
Outliers in age:
   customer_id        name  age
5        1005  Sarah Wilson -999
6        1006  MIKE BROWN    200

Age statistics after fixing outliers:
count     7.000000
mean     33.571429
std       6.800735
min      28.000000
25%      28.000000
50%      34.000000
75%      34.000000
max      45.000000
Name: age, dtype: float64

5. Standardizing Text Data

Text data often needs standardization:

# Standardize case in name (convert to title case)
df['name'] = df['name'].str.title()

# Check for invalid email formats
import re

def is_valid_email(email):
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    if pd.isna(email):
        return False
    return bool(re.match(pattern, email))

df['valid_email'] = df['email'].apply(is_valid_email)
print("Invalid emails:")
print(df[~df['valid_email']][['name', 'email']])

# Clean up email addresses (for this example, just add domain if missing)
def fix_email(email):
    if pd.isna(email) or '@' not in email:
        return "[email protected]"
    if not email.endswith('.com'):
        return email + '.com'
    return email

df['email'] = df['email'].apply(fix_email)

Output:

Invalid emails:
         name           email
4     Unknown  alice@example

Real-World Data Cleaning Example

Let's apply these techniques to a more realistic scenario using the Titanic dataset:

# Load the Titanic dataset
titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print(titanic.head())

# Get an overview
print("\nDataset info:")
print(titanic.info())
print("\nMissing values:")
print(titanic.isna().sum())

# Data cleaning steps
titanic_clean = titanic.copy()

# 1. Handle missing values
# Fill missing Age with median
titanic_clean['Age'] = titanic_clean['Age'].fillna(titanic_clean['Age'].median())

# Fill missing Embarked with most frequent value
most_common_embarked = titanic_clean['Embarked'].mode()[0]
titanic_clean['Embarked'] = titanic_clean['Embarked'].fillna(most_common_embarked)

# Drop Cabin column (too many missing values)
titanic_clean = titanic_clean.drop('Cabin', axis=1)

# 2. Feature engineering
# Extract titles from names
titanic_clean['Title'] = titanic_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Consolidate titles
title_mapping = {
    "Mr": "Mr",
    "Miss": "Miss",
    "Mrs": "Mrs",
    "Master": "Master",
    "Dr": "Other",
    "Rev": "Other",
    "Col": "Other",
    "Major": "Other",
    "Mlle": "Miss",
    "Ms": "Miss",
    "Countess": "Mrs",
    "Lady": "Mrs",
    "Jonkheer": "Other",
    "Don": "Mr",
    "Dona": "Mrs",
    "Mme": "Mrs",
    "Capt": "Other",
    "Sir": "Other"
}
titanic_clean['Title'] = titanic_clean['Title'].map(title_mapping)

# 3. Convert categorical features
titanic_clean['Sex'] = titanic_clean['Sex'].map({'female': 1, 'male': 0})

# One-hot encode Embarked
embarked_dummies = pd.get_dummies(titanic_clean['Embarked'], prefix='Embarked')
titanic_clean = pd.concat([titanic_clean, embarked_dummies], axis=1)

# Check results
print("\nCleaned dataset info:")
print(titanic_clean.info())
print("\nCleaned dataset sample:")
print(titanic_clean.head())

Best Practices for Data Cleaning

Always make a copy of your original data before cleaning
Document all transformations applied to your data
Validate your cleaning by checking data ranges and distributions
Automate repetitive cleaning tasks with functions
Handle missing data carefully - dropping rows may introduce bias
Check for outliers before and after cleaning
Standardize text data, dates, and categorical variables
Use consistent naming conventions for columns and values

Creating a Data Cleaning Pipeline

For repetitive cleaning tasks, create a pipeline:

def clean_customer_data(df):
    """
    A function to clean customer data
    """
    df = df.copy()  # Work with a copy
    
    # Handle missing values
    df['customer_id'] = df['customer_id'].fillna(-1).astype('Int64')
    df['name'] = df['name'].fillna("Unknown").str.title()
    df['email'] = df['email'].fillna("[email protected]")
    
    # Fix data types
    df['signup_date'] = df['signup_date'].apply(lambda x: pd.to_datetime(x, errors='coerce'))
    
    # Remove duplicates
    df = df.drop_duplicates(subset=['customer_id'])
    
    # Handle outliers in age
    median_age = df['age'].median()
    df.loc[df['age'] < 0, 'age'] = median_age
    df.loc[df['age'] > 120, 'age'] = median_age
    
    return df

# Use the pipeline
clean_df = clean_customer_data(df)
print("Cleaned dataframe:")
print(clean_df.head())

Summary

Data cleaning is a critical step in any data science project. In this guide, we've covered:

Identifying and handling missing values
Removing duplicate data
Fixing data types and formats
Handling outliers and invalid values
Standardizing text data
Creating data cleaning pipelines

Remember that data cleaning is often iterative - as you analyze your data, you may discover new issues that need addressing. The goal is to create a dataset that is accurate, consistent, and ready for analysis.

Additional Resources

Exercises

Download a messy dataset of your choice and apply the techniques from this guide.
Create a function to detect and remove outliers using the Z-score method.
Develop a data validation function that checks if your cleaned data meets specific criteria.
Apply regular expressions to clean a dataset with messy text fields.
Create a complete data cleaning pipeline for a specific type of dataset (e.g., financial data, customer data, or medical data).

Happy cleaning!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Why Data Cleaning Matters​

Setting Up Your Environment​

Loading a Sample Dataset​

Exploring the Dataset​

Common Data Cleaning Tasks​

1. Handling Missing Values​

2. Removing Duplicates​

3. Fixing Data Types​

4. Handling Outliers and Invalid Values​

5. Standardizing Text Data​

Real-World Data Cleaning Example​

Best Practices for Data Cleaning​

Creating a Data Cleaning Pipeline​

Summary​

Additional Resources​

Exercises​

Why Data Cleaning Matters

Setting Up Your Environment

Loading a Sample Dataset

Exploring the Dataset

Common Data Cleaning Tasks

1. Handling Missing Values

2. Removing Duplicates

3. Fixing Data Types

4. Handling Outliers and Invalid Values

5. Standardizing Text Data

Real-World Data Cleaning Example

Best Practices for Data Cleaning

Creating a Data Cleaning Pipeline

Summary

Additional Resources

Exercises