Skip to main content

Python Data Cleaning

Data cleaning is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies from datasets to improve their quality. In the data science workflow, data cleaning is a crucial step that often consumes a significant portion of a data scientist's time.

Why Data Cleaning Matters

Raw data is often messy:

  • It may contain missing values
  • It may have duplicate entries
  • Values might be in incorrect formats
  • There could be outliers that skew analysis
  • Text data might need normalization

Clean data leads to more accurate analyses, more reliable models, and ultimately better decisions. Let's explore how to clean data effectively using Python.

Setting Up Your Environment

First, let's import the libraries we'll need:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading a Sample Dataset

For our examples, we'll use a sample dataset with common data quality issues:

python
# Create a sample messy dataset
data = {
'customer_id': [1001, 1002, 1002, None, 1004, 1005, 1006],
'name': ['John Smith', 'Jane Doe', 'Jane Doe', 'Bob Johnson', None, 'Sarah Wilson', 'MIKE BROWN'],
'email': ['john@example.com', 'jane@example.com', 'jane@example.com', None, 'alice@example', 'sarah@example.com', 'mike@example.com'],
'age': [34, 28, 28, 45, 32, -999, 200],
'signup_date': ['2021-01-15', '2021/02/20', '2021/02/20', 'January 30, 2021', '2021-04-05', '2021-05-12', '2021-06-01'],
'last_purchase': ['$123.45', '$67.8', '$67.80', '$890.0', None, '$45.20', '$200.5']
}

df = pd.DataFrame(data)
print(df)

Output:

   customer_id        name             email  age       signup_date last_purchase
0 1001.0 John Smith john@example.com 34 2021-01-15 $123.45
1 1002.0 Jane Doe jane@example.com 28 2021/02/20 $67.8
2 1002.0 Jane Doe jane@example.com 28 2021/02/20 $67.80
3 NaN Bob Johnson None 45 January 30, 2021 $890.0
4 1004.0 None alice@example 32 2021-04-05 None
5 1005.0 Sarah Wilson sarah@example.com -999 2021-05-12 $45.20
6 1006.0 MIKE BROWN mike@example.com 200 2021-06-01 $200.5

Exploring the Dataset

Before cleaning, it's important to understand what issues exist in your data:

python
# Get basic information about the dataset
print("Dataset information:")
print(df.info())
print("\nDescriptive statistics:")
print(df.describe())
print("\nMissing values:")
print(df.isna().sum())

Output:

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 6 non-null float64
1 name 6 non-null object
2 email 5 non-null object
3 age 7 non-null int64
4 signup_date 7 non-null object
5 last_purchase 6 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 464.0+ bytes
None

Descriptive statistics:
customer_id age
count 6.000000 7.000000
mean 1003.333333 -8.857143
std 1.861899 309.766099
min 1001.000000 -999.000000
25% 1002.000000 28.000000
50% 1003.000000 34.000000
75% 1005.000000 45.000000
max 1006.000000 200.000000

Missing values:
customer_id 1
name 1
email 2
age 0
signup_date 0
last_purchase 1
dtype: int64

Common Data Cleaning Tasks

Let's address the main data cleaning tasks one by one:

1. Handling Missing Values

First, let's identify and deal with missing values:

python
# Check for missing values
print("Missing values before cleaning:")
print(df.isna().sum())

# Fill missing customer_id with a placeholder
df['customer_id'] = df['customer_id'].fillna(-1)

# Fill missing names with "Unknown"
df['name'] = df['name'].fillna("Unknown")

# Drop rows with missing email (if email is critical)
df_email_cleaned = df.dropna(subset=['email'])
print("\nDataset after dropping rows with missing emails:")
print(df_email_cleaned.shape)

# Alternative: Fill missing emails with a placeholder
df['email'] = df['email'].fillna("no_email@example.com")

# Fill missing last_purchase with 0
df['last_purchase'] = df['last_purchase'].fillna("$0")

print("\nMissing values after cleaning:")
print(df.isna().sum())

Output:

Missing values before cleaning:
customer_id 1
name 1
email 2
age 0
signup_date 0
last_purchase 1
dtype: int64

Dataset after dropping rows with missing emails:
(5, 6)

Missing values after cleaning:
customer_id 0
name 0
email 0
age 0
signup_date 0
last_purchase 0
dtype: int64

2. Removing Duplicates

Now let's identify and remove duplicate records:

python
# Check for duplicate rows
print(f"Duplicate rows: {df.duplicated().sum()}")

# Check for duplicates based on specific columns
print(f"Duplicate customer_id entries: {df.duplicated(subset=['customer_id']).sum()}")

# Remove duplicate rows (keep first occurrence)
df_no_dupes = df.drop_duplicates()
print(f"\nDataset shape before removing duplicates: {df.shape}")
print(f"Dataset shape after removing duplicates: {df_no_dupes.shape}")

# Remove duplicates based on specific columns (e.g., customer_id)
df_unique_customers = df.drop_duplicates(subset=['customer_id'])
print(f"Dataset after removing duplicate customers: {df_unique_customers.shape}")

Output:

Duplicate rows: 1
Duplicate customer_id entries: 1

Dataset shape before removing duplicates: (7, 6)
Dataset shape after removing duplicates: (6, 6)
Dataset after removing duplicate customers: (6, 6)

3. Fixing Data Types

Let's correct data types for better analysis:

python
# Convert customer_id to integer
df['customer_id'] = df['customer_id'].astype('Int64') # Int64 allows for NaN values

# Clean up and convert the signup_date to datetime
def parse_date(date_str):
try:
return pd.to_datetime(date_str)
except:
return pd.NaT

df['signup_date'] = df['signup_date'].apply(parse_date)

# Convert last_purchase from string to float (removing '$')
df['last_purchase'] = df['last_purchase'].str.replace('$', '').astype(float)

print(df.dtypes)

Output:

customer_id             Int64
name object
email object
age int64
signup_date datetime64[ns]
last_purchase float64
dtype: object

4. Handling Outliers and Invalid Values

Let's detect and handle outliers and invalid values:

python
# Identify outliers in age using IQR method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Age outliers bounds: lower = {lower_bound}, upper = {upper_bound}")
outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print("Outliers in age:")
print(outliers[['customer_id', 'name', 'age']])

# Replace invalid ages with median
median_age = df['age'].median()
df.loc[df['age'] < 0, 'age'] = median_age # Replace negative ages
df.loc[df['age'] > 120, 'age'] = median_age # Replace unrealistically high ages

print("\nAge statistics after fixing outliers:")
print(df['age'].describe())

Output:

Age outliers bounds: lower = 2.5, upper = 70.5
Outliers in age:
customer_id name age
5 1005 Sarah Wilson -999
6 1006 MIKE BROWN 200

Age statistics after fixing outliers:
count 7.000000
mean 33.571429
std 6.800735
min 28.000000
25% 28.000000
50% 34.000000
75% 34.000000
max 45.000000
Name: age, dtype: float64

5. Standardizing Text Data

Text data often needs standardization:

python
# Standardize case in name (convert to title case)
df['name'] = df['name'].str.title()

# Check for invalid email formats
import re

def is_valid_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
if pd.isna(email):
return False
return bool(re.match(pattern, email))

df['valid_email'] = df['email'].apply(is_valid_email)
print("Invalid emails:")
print(df[~df['valid_email']][['name', 'email']])

# Clean up email addresses (for this example, just add domain if missing)
def fix_email(email):
if pd.isna(email) or '@' not in email:
return "no_email@example.com"
if not email.endswith('.com'):
return email + '.com'
return email

df['email'] = df['email'].apply(fix_email)

Output:

Invalid emails:
name email
4 Unknown alice@example

Real-World Data Cleaning Example

Let's apply these techniques to a more realistic scenario using the Titanic dataset:

python
# Load the Titanic dataset
titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print(titanic.head())

# Get an overview
print("\nDataset info:")
print(titanic.info())
print("\nMissing values:")
print(titanic.isna().sum())

# Data cleaning steps
titanic_clean = titanic.copy()

# 1. Handle missing values
# Fill missing Age with median
titanic_clean['Age'] = titanic_clean['Age'].fillna(titanic_clean['Age'].median())

# Fill missing Embarked with most frequent value
most_common_embarked = titanic_clean['Embarked'].mode()[0]
titanic_clean['Embarked'] = titanic_clean['Embarked'].fillna(most_common_embarked)

# Drop Cabin column (too many missing values)
titanic_clean = titanic_clean.drop('Cabin', axis=1)

# 2. Feature engineering
# Extract titles from names
titanic_clean['Title'] = titanic_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Consolidate titles
title_mapping = {
"Mr": "Mr",
"Miss": "Miss",
"Mrs": "Mrs",
"Master": "Master",
"Dr": "Other",
"Rev": "Other",
"Col": "Other",
"Major": "Other",
"Mlle": "Miss",
"Ms": "Miss",
"Countess": "Mrs",
"Lady": "Mrs",
"Jonkheer": "Other",
"Don": "Mr",
"Dona": "Mrs",
"Mme": "Mrs",
"Capt": "Other",
"Sir": "Other"
}
titanic_clean['Title'] = titanic_clean['Title'].map(title_mapping)

# 3. Convert categorical features
titanic_clean['Sex'] = titanic_clean['Sex'].map({'female': 1, 'male': 0})

# One-hot encode Embarked
embarked_dummies = pd.get_dummies(titanic_clean['Embarked'], prefix='Embarked')
titanic_clean = pd.concat([titanic_clean, embarked_dummies], axis=1)

# Check results
print("\nCleaned dataset info:")
print(titanic_clean.info())
print("\nCleaned dataset sample:")
print(titanic_clean.head())

Best Practices for Data Cleaning

  1. Always make a copy of your original data before cleaning
  2. Document all transformations applied to your data
  3. Validate your cleaning by checking data ranges and distributions
  4. Automate repetitive cleaning tasks with functions
  5. Handle missing data carefully - dropping rows may introduce bias
  6. Check for outliers before and after cleaning
  7. Standardize text data, dates, and categorical variables
  8. Use consistent naming conventions for columns and values

Creating a Data Cleaning Pipeline

For repetitive cleaning tasks, create a pipeline:

python
def clean_customer_data(df):
"""
A function to clean customer data
"""
df = df.copy() # Work with a copy

# Handle missing values
df['customer_id'] = df['customer_id'].fillna(-1).astype('Int64')
df['name'] = df['name'].fillna("Unknown").str.title()
df['email'] = df['email'].fillna("no_email@example.com")

# Fix data types
df['signup_date'] = df['signup_date'].apply(lambda x: pd.to_datetime(x, errors='coerce'))

# Remove duplicates
df = df.drop_duplicates(subset=['customer_id'])

# Handle outliers in age
median_age = df['age'].median()
df.loc[df['age'] < 0, 'age'] = median_age
df.loc[df['age'] > 120, 'age'] = median_age

return df

# Use the pipeline
clean_df = clean_customer_data(df)
print("Cleaned dataframe:")
print(clean_df.head())

Summary

Data cleaning is a critical step in any data science project. In this guide, we've covered:

  • Identifying and handling missing values
  • Removing duplicate data
  • Fixing data types and formats
  • Handling outliers and invalid values
  • Standardizing text data
  • Creating data cleaning pipelines

Remember that data cleaning is often iterative - as you analyze your data, you may discover new issues that need addressing. The goal is to create a dataset that is accurate, consistent, and ready for analysis.

Additional Resources

Exercises

  1. Download a messy dataset of your choice and apply the techniques from this guide.
  2. Create a function to detect and remove outliers using the Z-score method.
  3. Develop a data validation function that checks if your cleaned data meets specific criteria.
  4. Apply regular expressions to clean a dataset with messy text fields.
  5. Create a complete data cleaning pipeline for a specific type of dataset (e.g., financial data, customer data, or medical data).

Happy cleaning!



If you spot any mistakes on this website, please let me know at feedback@compilenrun.com. I’d greatly appreciate your feedback! :)