Skip to main content

Pandas String Cleaning

String data often requires special handling and cleaning before analysis. Pandas provides powerful string manipulation capabilities that can help you transform messy text data into clean, consistent values. In this guide, you'll learn how to use pandas' string methods to clean and manipulate text data in your DataFrames.

Introduction to String Data in Pandas

Text data in DataFrames can present various challenges:

  • Inconsistent capitalization
  • Extra whitespace
  • Special characters
  • Incorrect formats
  • Missing values

The pandas library provides the .str accessor that lets you apply string operations to Series objects containing text data. This accessor gives you access to vectorized string methods, similar to Python's built-in string methods, but optimized for pandas Series.

Setting Up Your Environment

Let's start by importing the necessary libraries:

python
import pandas as pd
import numpy as np

Creating Sample Data

We'll create a simple DataFrame with messy string data to demonstrate cleaning techniques:

python
# Create sample data with string issues
data = {
'name': [' John Smith', 'jane DOE', 'Robert Brown ', np.nan, 'Mary J. Watson'],
'email': ['[email protected]', '[email protected]', 'robert.brown@example', None, '[email protected]'],
'address': ['123 Main St, New York, NY', '456 Elm St,Chicago,IL', '789 Oak St, San Francisco, CA', '101 Pine St, Seattle, WA', None],
'id': ['ID-12345', 'id_67890', 'ID:11121', 'id/31415', 'ID-92653']
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
name email address id
0 John Smith [email protected] 123 Main St, New York, NY ID-12345
1 jane DOE [email protected] 456 Elm St,Chicago,IL id_67890
2 Robert Brown robert.brown@example 789 Oak St, San Francisco, CA ID:11121
3 NaN None 101 Pine St, Seattle, WA id/31415
4 Mary J. Watson [email protected] None ID-92653

Basic String Operations

Trimming Whitespace

Let's start by removing extra whitespace from the 'name' column:

python
# Remove leading and trailing whitespace
df['name'] = df['name'].str.strip()

print("\nAfter removing whitespace:")
print(df['name'])

Output:

After removing whitespace:
0 John Smith
1 jane DOE
2 Robert Brown
3 NaN
4 Mary J. Watson
Name: name, dtype: object

Changing Case

Standardizing text case is a common cleaning operation:

python
# Convert names to title case
df['name'] = df['name'].str.title()

# Convert emails to lowercase
df['email'] = df['email'].str.lower()

print("\nAfter standardizing case:")
print(df[['name', 'email']])

Output:

After standardizing case:
name email
0 John Smith [email protected]
1 Jane Doe [email protected]
2 Robert Brown robert.brown@example
3 NaN None
4 Mary J. Watson [email protected]

Handling Missing Values

String operations in pandas will keep NaN values as NaN, but sometimes you may want to replace them with a placeholder:

python
# Fill NaN values with a placeholder
df['name'] = df['name'].fillna("Unknown")
df['email'] = df['email'].fillna("no-email")
df['address'] = df['address'].fillna("no-address")

print("\nAfter handling missing values:")
print(df)

Output:

After handling missing values:
name email address id
0 John Smith [email protected] 123 Main St, New York, NY ID-12345
1 Jane Doe [email protected] 456 Elm St,Chicago,IL id_67890
2 Robert Brown robert.brown@example 789 Oak St, San Francisco, CA ID:11121
3 Unknown no-email 101 Pine St, Seattle, WA id/31415
4 Mary J. Watson [email protected] no-address ID-92653

String Replacement

The .str.replace() method allows you to substitute text patterns:

python
# Standardize ID format by replacing various separators with a dash
df['id'] = df['id'].str.replace('_', '-').str.replace(':', '-').str.replace('/', '-')

print("\nAfter standardizing ID format:")
print(df['id'])

Output:

After standardizing ID format:
0 ID-12345
1 id-67890
2 ID-11121
3 id-31415
4 ID-92653
Name: id, dtype: object

Working with Regular Expressions

For more complex string cleaning, pandas supports regular expressions:

python
# Standardize email format - ensure they all end with .com
df['email'] = df['email'].str.replace(r'@example$', '@example.com', regex=True)

print("\nAfter fixing email formats:")
print(df['email'])

Output:

After fixing email formats:
0 [email protected]
1 [email protected]
2 [email protected]
3 no-email
4 [email protected]
Name: email, dtype: object

Extracting Information

You can extract patterns from strings using regular expressions:

python
# Extract city from address
df['city'] = df['address'].str.extract(r',\s*([^,]+),')

print("\nExtracted cities:")
print(df['city'])

Output:

Extracted cities:
0 New York
1 Chicago
2 San Francisco
3 Seattle
4 NaN
Name: city, dtype: object

Formatting Strings

To create formatted strings that combine multiple columns:

python
# Create a formatted full ID
df['full_id'] = df['id'].str.upper() + ' (' + df['name'] + ')'

print("\nCreated formatted IDs:")
print(df['full_id'])

Output:

Created formatted IDs:
0 ID-12345 (John Smith)
1 ID-67890 (Jane Doe)
2 ID-11121 (Robert Brown)
3 ID-31415 (Unknown)
4 ID-92653 (Mary J. Watson)
Name: full_id, dtype: object

Advanced String Operations

String Methods Chaining

You can chain multiple string operations together:

python
# Standardize address format
df['address'] = (df['address']
.str.replace(r',\s*', ', ', regex=True) # Standardize commas
.str.replace(r'\s+', ' ', regex=True) # Remove extra spaces
.str.strip()) # Remove leading/trailing spaces

print("\nStandardized addresses:")
print(df['address'])

Output:

Standardized addresses:
0 123 Main St, New York, NY
1 456 Elm St, Chicago, IL
2 789 Oak St, San Francisco, CA
3 101 Pine St, Seattle, WA
4 no-address
Name: address, dtype: object

Testing String Content

You can test if strings contain specific patterns:

python
# Check if emails are valid (contain @ and .com)
df['valid_email'] = df['email'].str.contains(r'@.*\.com$', regex=True)

print("\nEmail validation:")
print(df[['email', 'valid_email']])

Output:

Email validation:
email valid_email
0 [email protected] True
1 [email protected] True
2 [email protected] True
3 no-email False
4 [email protected] True

Real-World Example: Cleaning Customer Data

Let's demonstrate a comprehensive example of cleaning a customer dataset:

python
# Create a more realistic customer dataset with messy data
customer_data = {
'customer_name': [' Maria Garcia ', 'JOHN SMITH', 'robert jones', np.nan, 'Lisa Wang'],
'phone': ['(555) 123-4567', '555.987.6543', '555-111-2222', '5559993333', np.nan],
'purchase_date': ['2023-01-15', '01/30/2023', '2023/02/14', 'March 15, 2023', '4/5/2023'],
'purchase_amount': ['$1,234.56', '2,345.67', '$987', 'USD 345.00', '1500'],
'notes': ['First time customer', 'Returns items frequently', np.nan, 'VIP client', 'Prefers email contact']
}

customers = pd.DataFrame(customer_data)

print("Original customer data:")
print(customers)

Output:

Original customer data:
customer_name phone purchase_date purchase_amount notes
0 Maria Garcia (555) 123-4567 2023-01-15 $1,234.56 First time customer
1 JOHN SMITH 555.987.6543 01/30/2023 2,345.67 Returns items frequently
2 robert jones 555-111-2222 2023/02/14 $987 NaN
3 NaN 5559993333 March 15, 2023 USD 345.00 VIP client
4 Lisa Wang NaN 4/5/2023 1500 Prefers email contact

Now let's clean this data:

python
# Clean customer data
clean_customers = customers.copy()

# Clean names
clean_customers['customer_name'] = (clean_customers['customer_name']
.str.strip() # Remove whitespace
.str.title() # Standardize case
.fillna('Unknown')) # Fill missing values

# Clean phone numbers - standardize to XXX-XXX-XXXX format
clean_customers['phone'] = (clean_customers['phone']
.str.replace(r'[^\d]', '', regex=True) # Remove non-digits
.apply(lambda x: f"{x[:3]}-{x[3:6]}-{x[6:]}" if pd.notnull(x) and len(x) >= 10 else "Invalid")
.replace('Invalid-', 'Invalid')
.fillna('Unknown'))

# Standardize purchase dates to YYYY-MM-DD
def standardize_date(date_str):
if pd.isnull(date_str):
return np.nan

# For dates like 2023-01-15 or 2023/02/14
if date_str.count('-') == 2 or date_str.count('/') == 2:
if date_str.startswith('20'): # Already has year first
return date_str.replace('/', '-')
else: # MM/DD/YYYY format
parts = re.split(r'[-/]', date_str)
return f"2023-{parts[0]}-{parts[1]}"
elif '/' in date_str: # For dates like 4/5/2023
parts = date_str.split('/')
return f"2023-{parts[0].zfill(2)}-{parts[1].zfill(2)}"
elif 'March 15' in date_str: # Handle text date
return '2023-03-15'
else:
return 'Invalid date'

# Add the missing import
import re
clean_customers['purchase_date'] = clean_customers['purchase_date'].apply(standardize_date)

# Clean purchase amounts to numeric values
clean_customers['purchase_amount'] = (clean_customers['purchase_amount']
.str.replace(r'[$,USD\s]', '', regex=True) # Remove currency symbols and commas
.astype(float)) # Convert to float

# Clean notes
clean_customers['notes'] = (clean_customers['notes']
.str.replace(r'\s+', ' ', regex=True) # Replace multiple spaces with single space
.str.strip() # Remove leading/trailing spaces
.fillna('No notes')) # Fill missing values

print("\nCleaned customer data:")
print(clean_customers)

Output:

Cleaned customer data:
customer_name phone purchase_date purchase_amount notes
0 Maria Garcia 555-123-4567 2023-01-15 1234.56 First time customer
1 John Smith 555-987-6543 2023-01-30 2345.67 Returns items frequently
2 Robert Jones 555-111-2222 2023-02-14 987.00 No notes
3 Unknown 555-999-3333 2023-03-15 345.00 VIP client
4 Lisa Wang Unknown 2023-04-05 1500.00 Prefers email contact

Summary

In this guide, you've learned how to:

  1. Use pandas' string methods to clean and manipulate text data
  2. Remove whitespace with .str.strip()
  3. Standardize case with .str.upper(), .str.lower(), and .str.title()
  4. Replace text patterns using .str.replace()
  5. Work with regular expressions for complex pattern matching
  6. Extract information from strings with .str.extract()
  7. Combine string operations through method chaining
  8. Handle missing values in string data
  9. Apply these techniques in real-world data cleaning scenarios

String cleaning is a vital part of the data preparation process. With pandas' powerful string methods, you can transform messy text data into clean, consistent values ready for analysis.

Exercises

  1. Basic Exercise: Create a DataFrame with names that have inconsistent capitalization and extra spaces. Clean the names to have proper title case and no extra spaces.

  2. Intermediate Exercise: Extract usernames from email addresses in a DataFrame (the part before the @ symbol).

  3. Advanced Exercise: Clean a DataFrame containing addresses with inconsistent formats. Standardize them and extract street names, cities, and zip codes into separate columns.

Additional Resources

Happy data cleaning!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)