Skip to main content

Pandas String Methods

Introduction

When working with data in the real world, you'll frequently encounter textual information that needs cleaning, standardizing, or extracting specific patterns. Pandas provides powerful string manipulation capabilities through the .str accessor, which gives you access to vectorized string operations for Series or Index objects containing string values.

In this tutorial, we'll explore how to apply string operations to pandas Series objects, manipulate text data efficiently, and solve common string processing challenges.

Understanding the String Accessor

The .str accessor in pandas allows you to apply Python's string methods to each element in a Series. This is much more efficient than using loops or applying functions element-by-element.

Let's start with a simple example:

python
import pandas as pd

# Create a Series with string data
names = pd.Series(['Alice', 'Bob', 'Charlie', 'David', 'EMMA'])

# Convert all strings to lowercase
lowercase_names = names.str.lower()

print("Original names:")
print(names)
print("\nLowercase names:")
print(lowercase_names)

Output:

Original names:
0 Alice
1 Bob
2 Charlie
3 David
4 EMMA
dtype: object

Lowercase names:
0 alice
1 bob
2 charlie
3 david
4 emma
dtype: object

Common String Methods

Let's explore some of the most commonly used string methods in pandas:

Case Conversion

python
import pandas as pd

text_data = pd.Series(['PYTHON', 'pandas', 'Data Science', 'mAcHiNe LeArNiNg'])

print("Original data:")
print(text_data)
print("\nLowercase:")
print(text_data.str.lower())
print("\nUppercase:")
print(text_data.str.upper())
print("\nTitle case:")
print(text_data.str.title())
print("\nSwapcase:")
print(text_data.str.swapcase())

Output:

Original data:
0 PYTHON
1 pandas
2 Data Science
3 mAcHiNe LeArNiNg
dtype: object

Lowercase:
0 python
1 pandas
2 data science
3 machine learning
dtype: object

Uppercase:
0 PYTHON
1 PANDAS
2 DATA SCIENCE
3 MACHINE LEARNING
dtype: object

Title case:
0 Python
1 Pandas
2 Data Science
3 Machine Learning
dtype: object

Swapcase:
0 python
1 PANDAS
2 dATA sCIENCE
3 MaChInE lEaRnInG
dtype: object

String Length and Padding

python
import pandas as pd

codes = pd.Series(['ABC123', 'DEF', 'GHI456789', 'JK'])

print("Original codes:")
print(codes)
print("\nString lengths:")
print(codes.str.len())
print("\nPadded with zeros (width 8):")
print(codes.str.pad(width=8, side='left', fillchar='0'))

Output:

Original codes:
0 ABC123
1 DEF
2 GHI456789
3 JK
dtype: object

String lengths:
0 6
1 3
2 9
3 2
dtype: int64

Padded with zeros (width 8):
0 00ABC123
1 00000DEF
2 GHI456789
3 000000JK
dtype: object

String Manipulation

python
import pandas as pd

phrases = pd.Series([
" Hello, World! ",
"Python_Programming",
"data-science-is-fun",
"machine learning"
])

print("Original phrases:")
print(phrases)
print("\nStripped whitespace:")
print(phrases.str.strip())
print("\nReplace underscores with spaces:")
print(phrases.str.replace('_', ' '))
print("\nSplit by delimiter:")
print(phrases.str.split('-').tolist())

Output:

Original phrases:
0 Hello, World!
1 Python_Programming
2 data-science-is-fun
3 machine learning
dtype: object

Stripped whitespace:
0 Hello, World!
1 Python_Programming
2 data-science-is-fun
3 machine learning
dtype: object

Replace underscores with spaces:
0 Hello, World!
1 Python Programming
2 data-science-is-fun
3 machine learning
dtype: object

Split by delimiter:
[[' Hello, World! '], ['Python_Programming'], ['data', 'science', 'is', 'fun'], ['machine learning']]

Pattern Matching and Extraction

One of the most powerful features of pandas string methods is the ability to match and extract patterns using regular expressions.

Checking if strings contain a pattern

python
import pandas as pd

emails = pd.Series([
'[email protected]',
'[email protected]',
'not-an-email',
'[email protected]'
])

print("Contains '@':")
print(emails.str.contains('@'))
print("\nMatches email pattern:")
print(emails.str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'))

Output:

Contains '@':
0 True
1 True
2 False
3 True
dtype: bool

Matches email pattern:
0 True
1 True
2 False
3 True
dtype: bool

Extracting patterns

python
import pandas as pd

text = pd.Series([
"Product: ABC123 - Price: $19.99",
"Product: XYZ789 - Price: $549.50",
"Product: DEF456 - Price: $99.99",
"No product code or price"
])

# Extract product codes
product_codes = text.str.extract(r'Product: ([A-Z0-9]+)')
print("Extracted product codes:")
print(product_codes)

# Extract prices
prices = text.str.extract(r'Price: \$([0-9.]+)')
print("\nExtracted prices:")
print(prices)

# Convert to numeric
prices_numeric = prices[0].astype(float)
print("\nPrices as numeric values:")
print(prices_numeric)

Output:

Extracted product codes:
0
0 ABC123
1 XYZ789
2 DEF456
3 NaN

Extracted prices:
0
0 19.99
1 549.50
2 99.99
3 NaN

Prices as numeric values:
0 19.99
1 549.50
2 99.99
3 NaN
Name: 0, dtype: float64

Real-world Applications

Let's explore some practical examples of using string methods to solve common data cleaning tasks:

Example 1: Cleaning Product Names and Standardizing Formats

python
import pandas as pd

# Sample product data with inconsistent formatting
products = pd.DataFrame({
'product_id': [1, 2, 3, 4, 5],
'product_name': [
' LAPTOP - HP 15.6" (silver) ',
'smartphone-samsung galaxy s21',
'WIRELESS_HEADPHONES Sony',
'tablet - APPLE ipad PRO 12.9"',
' keyboard logitech k380 '
]
})

print("Original product names:")
print(products['product_name'])

# Clean and standardize product names
products['clean_name'] = (products['product_name']
.str.strip() # Remove leading/trailing spaces
.str.replace('_', ' ') # Replace underscores with spaces
.str.replace('-', ' - ') # Standardize hyphens
.str.title()) # Convert to title case

print("\nCleaned product names:")
print(products['clean_name'])

# Extract brand names (assuming they come before the model)
products['brand'] = products['clean_name'].str.extract(r'(HP|Samsung|Sony|Apple|Logitech)', flags=re.IGNORECASE)[0].str.title()

print("\nExtracted brands:")
print(products['brand'])

Output:

Original product names:
0 LAPTOP - HP 15.6" (silver)
1 smartphone-samsung galaxy s21
2 WIRELESS_HEADPHONES Sony
3 tablet - APPLE ipad PRO 12.9"
4 keyboard logitech k380
dtype: object

Cleaned product names:
0 Laptop - Hp 15.6" (Silver)
1 Smartphone - Samsung Galaxy S21
2 Wireless Headphones Sony
3 Tablet - Apple Ipad Pro 12.9"
4 Keyboard Logitech K380
dtype: object

Extracted brands:
0 HP
1 Samsung
2 Sony
3 Apple
4 Logitech
dtype: object

Example 2: Parsing and Cleaning Address Data

python
import pandas as pd

# Sample address data
addresses = pd.DataFrame({
'raw_address': [
'123 MAIN ST, APT 4B, NEW YORK, NY 10001',
'456 oak avenue, apt. 2c, chicago il, 60611',
'789 pine rd seattle wa 98101',
'321 CEDAR BLVD., SAN FRANCISCO, CA, 94107'
]
})

print("Original addresses:")
print(addresses['raw_address'])

# Standardize address formatting
addresses['clean_address'] = addresses['raw_address'].str.title()
print("\nStandardized addresses:")
print(addresses['clean_address'])

# Extract zip codes
addresses['zip_code'] = addresses['raw_address'].str.extract(r'(\d{5})')
print("\nExtracted ZIP codes:")
print(addresses['zip_code'])

# Extract city and state
addresses[['city', 'state']] = addresses['raw_address'].str.extract(r'([A-Za-z\s]+),\s*([A-Z]{2})')
print("\nExtracted city and state:")
print(addresses[['city', 'state']])

Output:

Original addresses:
0 123 MAIN ST, APT 4B, NEW YORK, NY 10001
1 456 oak avenue, apt. 2c, chicago il, 60611
2 789 pine rd seattle wa 98101
3 321 CEDAR BLVD., SAN FRANCISCO, CA, 94107
dtype: object

Standardized addresses:
0 123 Main St, Apt 4B, New York, Ny 10001
1 456 Oak Avenue, Apt. 2C, Chicago Il, 60611
2 789 Pine Rd Seattle Wa 98101
3 321 Cedar Blvd., San Francisco, Ca, 94107
dtype: object

Extracted ZIP codes:
0
0 10001
1 60611
2 98101
3 94107

Extracted city and state:
city state
0 NEW YORK NY
1 chicago il
2 NaN NaN
3 SAN FRANCISCO CA

Advanced String Methods

Pandas string methods go beyond basic operations. Here are some advanced features:

Find and Replace with Regular Expressions

python
import pandas as pd

text = pd.Series([
"Contact us at [email protected] for help.",
"Email [email protected] with questions.",
"Visit our website at https://www.example.com",
"Call us at 123-456-7890 or 987.654.3210"
])

# Replace email addresses with [EMAIL REDACTED]
redacted = text.str.replace(r'[\w\.-]+@[\w\.-]+\.\w+', '[EMAIL REDACTED]', regex=True)
print("After redacting emails:")
print(redacted)

# Extract all phone numbers
phone_numbers = text.str.extractall(r'(\d{3}[-\.]\d{3}[-\.]\d{4})')
print("\nExtracted phone numbers:")
print(phone_numbers)

Output:

After redacting emails:
0 Contact us at [EMAIL REDACTED] for help.
1 Email [EMAIL REDACTED] with questions.
2 Visit our website at https://www.example.com
3 Call us at 123-456-7890 or 987.654.3210
dtype: object

Extracted phone numbers:
0
match
3 0 123-456-7890
1 987.654.3210

Working with Categorical String Data

python
import pandas as pd

# Sample dataset with categorical strings
data = pd.DataFrame({
'item_category': [
'Electronics > Computers > Laptops',
'Clothing > Men > Shirts',
'Electronics > Audio > Headphones',
'Home > Kitchen > Appliances',
'Clothing > Women > Dresses'
]
})

# Extract the main category (before first '>')
data['main_category'] = data['item_category'].str.split(' > ').str[0]

# Extract the subcategory (second level)
data['subcategory'] = data['item_category'].str.split(' > ').str[1]

# Extract the product type (third level)
data['product_type'] = data['item_category'].str.split(' > ').str[2]

print(data)

# Count occurrences of each main category
category_counts = data['main_category'].value_counts()
print("\nCategory counts:")
print(category_counts)

Output:

                        item_category main_category subcategory product_type
0 Electronics > Computers > Laptops Electronics Computers Laptops
1 Clothing > Men > Shirts Clothing Men Shirts
2 Electronics > Audio > Headphones Electronics Audio Headphones
3 Home > Kitchen > Appliances Home Kitchen Appliances
4 Clothing > Women > Dresses Clothing Women Dresses

Category counts:
Electronics 2
Clothing 2
Home 1
Name: main_category, dtype: int64

Handling Missing String Values

String methods in pandas handle missing values differently than other operations. It's important to understand how to properly deal with them:

python
import pandas as pd
import numpy as np

# Create a Series with some missing values
text_with_missing = pd.Series(['Python', np.nan, 'Data Science', None, 'Pandas'])

print("Series with missing values:")
print(text_with_missing)

# String methods automatically skip NaN values
print("\nUppercase (NaNs stay as NaN):")
print(text_with_missing.str.upper())

# Fill NaN values before applying string methods
filled_text = text_with_missing.fillna('Unknown')
print("\nAfter filling NaN values and converting to uppercase:")
print(filled_text.str.upper())

# Check if values are NaN with specific string methods
print("\nIs null check:")
print(text_with_missing.isnull())

Output:

Series with missing values:
0 Python
1 NaN
2 Data Science
3 None
4 Pandas
dtype: object

Uppercase (NaNs stay as NaN):
0 PYTHON
1 NaN
2 DATA SCIENCE
3 NaN
4 PANDAS
dtype: object

After filling NaN values and converting to uppercase:
0 PYTHON
1 UNKNOWN
2 DATA SCIENCE
3 UNKNOWN
4 PANDAS
dtype: object

Is null check:
0 False
1 True
2 False
3 True
4 False
dtype: bool

Performance Considerations

String operations can be computationally expensive on large datasets. Here are some tips for improving performance:

  1. Pre-compile regular expressions when using them repeatedly
  2. Use vectorized operations through the .str accessor instead of applying Python functions
  3. Consider categorical data for string columns with repeated values
  4. Use appropriate data types - don't store numeric data as strings
python
import pandas as pd
import re
import time

# Create a large Series with string data
large_series = pd.Series(['text_' + str(i) for i in range(100000)])

# Method 1: Using a lambda function (slower)
start_time = time.time()
result1 = large_series.apply(lambda x: x.replace('text', 'item'))
method1_time = time.time() - start_time

# Method 2: Using vectorized .str accessor (faster)
start_time = time.time()
result2 = large_series.str.replace('text', 'item')
method2_time = time.time() - start_time

print(f"Lambda function time: {method1_time:.4f} seconds")
print(f"Vectorized .str time: {method2_time:.4f} seconds")
print(f"Vectorized operations are {method1_time/method2_time:.1f}x faster")

Output:

Lambda function time: 0.1872 seconds
Vectorized .str time: 0.0423 seconds
Vectorized operations are 4.4x faster

Summary

In this tutorial, we've explored pandas string methods, which provide powerful capabilities for manipulating text data in DataFrames and Series. Here's what we covered:

  • Using the .str accessor to perform vectorized string operations
  • Common string transformations like case conversion, padding, and whitespace removal
  • Pattern matching and extraction using regular expressions
  • Real-world applications including cleaning product names and parsing addresses
  • Advanced string methods for working with categorical data and complex patterns
  • Handling missing values in string operations
  • Performance considerations for efficient string operations

Pandas string methods make working with textual data much more efficient and expressive, allowing you to clean, standardize, and extract information from text without resorting to slow loops or complicated functions.

Exercises

  1. Create a DataFrame with a column containing email addresses and use string methods to:

    • Extract the username (part before @)
    • Extract the domain (part after @)
    • Check if the domain is gmail.com
  2. You have a Series of full names (e.g., "John Smith", "Jane Doe"). Write code to:

    • Convert each name to "Last, First" format
    • Extract just the first initial and last name
  3. Given a Series of messy product descriptions, clean the data by:

    • Removing any special characters
    • Standardizing spacing
    • Converting to title case

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)