Python Regular Expressions

Introduction

Regular expressions (regex or regexp) are powerful sequences of characters that define search patterns. In Python, regular expressions allow you to search for patterns in text, validate input, extract information, and perform complex string manipulations that would be difficult with standard string methods.

The re module in Python provides operations for working with regular expressions. This module enables you to check if a particular string matches a given pattern, extract parts of a string that match your criteria, or replace text that matches a pattern.

Why Learn Regular Expressions?

Powerful text processing: Accomplish complex pattern matching in just a few lines of code
Data validation: Verify if inputs match expected formats (email addresses, phone numbers, etc.)
Data extraction: Pull specific information from larger text bodies
Text transformation: Replace or reformat text based on patterns

Getting Started with Regular Expressions

To use regular expressions in Python, you'll need to import the re module first:

import re

Basic Pattern Matching

Let's start with a simple example - checking if a string contains a specific pattern:

import re

text = "Python is awesome"
pattern = "Python"

result = re.search(pattern, text)
if result:
    print("Pattern found!")
else:
    print("Pattern not found.")

Output:

Pattern found!

The search() function looks for the first occurrence of the pattern in the text and returns a match object if found, or None if not found.

Core Regular Expression Functions

Python's re module provides several key functions for working with patterns:

1. `re.search(pattern, string)`

Searches for the first occurrence of the pattern in the string.

import re

text = "Python was created in 1991 by Guido van Rossum"
result = re.search(r"created in (\d+)", text)

if result:
    print(f"Python was created in {result.group(1)}")

Output:

Python was created in 1991

2. `re.match(pattern, string)`

Checks if the pattern matches at the beginning of the string.

import re

# This will match
text1 = "Python is great"
result1 = re.match(r"Python", text1)
print(result1 is not None)  # True

# This won't match
text2 = "I love Python"
result2 = re.match(r"Python", text2)
print(result2 is not None)  # False

Output:

True
False

3. `re.findall(pattern, string)`

Returns all non-overlapping matches of the pattern in the string as a list.

import re

text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)

Output:

['[email protected]', '[email protected]']

4. `re.sub(pattern, replacement, string)`

Replaces occurrences of the pattern in the string with the replacement text.

import re

text = "Python was created in 1991"
result = re.sub(r"\d+", "YEAR", text)
print(result)

Output:

Python was created in YEAR

Special Characters in Regular Expressions

Regular expressions use special characters to represent patterns:

Character	Description	Example
`.`	Any character except newline	`a.b` matches "acb", "adb", etc.
`^`	Start of string	`^Python` matches "Python" at the beginning
`$`	End of string	`Python$` matches "Python" at the end
`*`	0 or more occurrences	`ab*c` matches "ac", "abc", "abbc", etc.
`+`	1 or more occurrences	`ab+c` matches "abc", "abbc", but not "ac"
`?`	0 or 1 occurrence	`ab?c` matches "ac" and "abc"
`\d`	Any digit	`\d+` matches one or more digits
`\w`	Any alphanumeric character	`\w+` matches one or more letters/digits/underscores
`\s`	Any whitespace	`\s+` matches one or more spaces, tabs, newlines
`[]`	Character class	`[abc]` matches "a", "b", or "c"
`()`	Groups patterns	`(ab)+` matches "ab", "abab", etc.
`\b`	Word boundary	`\bword\b` matches the whole word "word"

Let's see some examples:

import re

# Match any email pattern
text = "Contact me at [email protected]"
email = re.search(r"\b[\w.]+@[\w.]+\.\w+\b", text)
print(email.group())  # [email protected]

# Find all words starting with 'p'
text = "Python programming is powerful and practical"
p_words = re.findall(r"\bp\w+", text, re.IGNORECASE)
print(p_words)  # ['Python', 'programming', 'powerful', 'practical']

# Extract all dates in MM/DD/YYYY format
text = "Event dates: 12/25/2023, 01/15/2024, and 02/28/2024."
dates = re.findall(r"\b\d{2}/\d{2}/\d{4}\b", text)
print(dates)  # ['12/25/2023', '01/15/2024', '02/28/2024']

Character Classes and Quantifiers

Character Classes

Character classes allow you to match any one character from a set:

import re

# Match any vowel
text = "Python"
vowels = re.findall(r"[aeiou]", text)
print(vowels)  # ['o']

# Match any digit
text = "Python was created in 1991"
digits = re.findall(r"[0-9]", text)
print(digits)  # ['1', '9', '9', '1']

# Match characters within a range
text = "The ZIP code is 90210"
result = re.search(r"[0-9]{5}", text)
print(result.group())  # 90210

Quantifiers

Quantifiers specify how many times a character or group should occur:

import re

# Match 0 or more 'o's
text = "Gooooooal!"
result = re.search(r"Go*al", text)
print(result.group())  # Gooooooal

# Match exactly 3 digits
text = "The area code is 415."
result = re.search(r"\d{3}", text)
print(result.group())  # 415

# Match between 2 and 4 occurrences of 'ha'
text = "hahahahaha"
result = re.findall(r"(ha){2,4}", text)
print(result)  # ['ha', 'ha']

Grouping and Capturing

Parentheses () are used to group parts of a regular expression and capture matched text:

import re

# Extract information from text
text = "John Smith was born on 1990-05-15"
result = re.search(r"(\w+) (\w+) was born on (\d{4})-(\d{2})-(\d{2})", text)

if result:
    print(f"Full name: {result.group(1)} {result.group(2)}")
    print(f"Birth year: {result.group(3)}")
    print(f"Birth month: {result.group(4)}")
    print(f"Birth day: {result.group(5)}")

Output:

Full name: John Smith
Birth year: 1990
Birth month: 05
Birth day: 15

Named Groups

For more readability, you can use named groups with (?P<name>pattern) syntax:

import re

text = "John Smith was born on 1990-05-15"
pattern = r"(?P<first>\w+) (?P<last>\w+) was born on (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
result = re.search(pattern, text)

if result:
    print(f"First name: {result.group('first')}")
    print(f"Last name: {result.group('last')}")
    print(f"Birth date: {result.group('month')}/{result.group('day')}/{result.group('year')}")

Output:

First name: John
Last name: Smith
Birth date: 05/15/1990

Real-World Applications

Example 1: Email Validation

import re

def validate_email(email):
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return bool(re.match(pattern, email))

# Test the function
emails = ["[email protected]", "invalid.email@", "[email protected]", "not-an-email"]

for email in emails:
    if validate_email(email):
        print(f"{email} is a valid email address")
    else:
        print(f"{email} is NOT a valid email address")

Output:

[email protected] is a valid email address
invalid.email@ is NOT a valid email address
[email protected] is a valid email address
not-an-email is NOT a valid email address

Example 2: Extracting Information from a Log File

import re

log = """
2023-03-15 14:23:45 INFO User logged in: alice
2023-03-15 14:25:16 ERROR Failed to connect to database
2023-03-15 14:26:02 WARNING Disk usage above 80%
2023-03-15 14:30:45 INFO User logged out: alice
"""

# Extract all timestamps and messages
pattern = r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.+)"
matches = re.findall(pattern, log)

for timestamp, level, message in matches:
    print(f"[{timestamp}] {level}: {message}")

# Extract only ERROR entries
errors = re.findall(r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} ERROR (.+)", log)
print("\nErrors found:")
for error in errors:
    print(f"- {error}")

Output:

[2023-03-15 14:23:45] INFO: User logged in: alice
[2023-03-15 14:25:16] ERROR: Failed to connect to database
[2023-03-15 14:26:02] WARNING: Disk usage above 80%
[2023-03-15 14:30:45] INFO: User logged out: alice

Errors found:
- Failed to connect to database

Example 3: Parsing HTML (Basic)

import re

html = """
<div class="content">
    <h1>Welcome to Python</h1>
    <p>Learn Python <a href="https://python.org">here</a>.</p>
    <p>Or check out <a href="https://docs.python.org">the documentation</a>.</p>
</div>
"""

# Extract all links
links = re.findall(r'<a href="([^"]+)"', html)
print("Links found:", links)

# Extract headers
headers = re.findall(r'<h(\d)>([^<]+)</h\1>', html)
for level, text in headers:
    print(f"Header (level {level}): {text}")

Output:

Links found: ['https://python.org', 'https://docs.python.org']
Header (level 1): Welcome to Python

Compilation Flags

Python's re module provides flags that modify how patterns are interpreted:

import re

text = "Python is amazing\nPYTHON is powerful"

# Case-insensitive matching
results1 = re.findall(r"python", text, re.IGNORECASE)
print(results1)  # ['Python', 'PYTHON']

# Multi-line mode: ^ and $ match at line breaks
results2 = re.findall(r"^python", text, re.IGNORECASE | re.MULTILINE)
print(results2)  # ['Python', 'PYTHON']

Common flags include:

re.IGNORECASE: Case-insensitive matching
re.MULTILINE: Make ^ and $ match at line breaks
re.DOTALL: Make . match newlines too
re.VERBOSE: Allow comments and whitespace in patterns

Compiling Regular Expressions

If you're using the same pattern multiple times, compiling it can improve performance:

import re

# Compile the pattern once
email_pattern = re.compile(r'\b[\w\.-]+@[\w\.-]+\.\w+\b')

# Use it multiple times
text1 = "Contact us at [email protected]"
text2 = "Send your resume to [email protected]"

print(email_pattern.search(text1).group())  # [email protected]
print(email_pattern.search(text2).group())  # [email protected]

Common Pitfalls and Tips

Escape special characters when you want to match them literally:

# To match a literal period:
pattern = r"example\.com"   # NOT "example.com"

Be careful with greedy quantifiers (*, +, etc.) as they match as much as possible:

text = "<div>Content here</div>"
greedy = re.search(r"<.+>", text).group()      # Matches "<div>Content here</div>"
non_greedy = re.search(r"<.+?>", text).group()  # Matches "<div>" (use ? after quantifier)

Use raw strings (r"pattern") to avoid issues with backslashes:

pattern = r"\d+"   # Correct: Matches digits
# NOT: pattern = "\d+"  # Incorrect: \d becomes a special character

Test your patterns on edge cases to ensure they work as expected.

Summary

Regular expressions are a powerful tool for text processing in Python. We've covered:

Basic pattern matching with search(), match(), findall(), and sub()
Special characters and metacharacters for constructing patterns
Character classes and quantifiers for more complex matching
Grouping and capturing specific parts of matched text
Real-world applications like email validation and log parsing
Tips to avoid common regex pitfalls

With regular expressions, you can perform sophisticated text operations in just a few lines of code, making them an essential tool in a Python programmer's toolkit.

Exercises

Write a regular expression to extract all phone numbers in the format (XXX) XXX-XXXX from text.
Create a function that validates if a password meets these criteria:
- At least 8 characters long
- Contains at least one uppercase letter
- Contains at least one lowercase letter
- Contains at least one digit
- Contains at least one special character
Write a regex to extract all hashtags (words starting with #) from a social media post.
Create a function that can extract all dates in the format MM/DD/YYYY, DD/MM/YYYY, or YYYY-MM-DD from text.

Additional Resources

Python's re module documentation
Regular Expression HOWTO
Regex101 - An online regex tester with explanations
RegexOne - Interactive regex tutorial

Remember that regular expressions are a skill that improves with practice. Start with simple patterns and gradually tackle more complex text processing challenges as you become comfortable with the syntax.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Learn Regular Expressions?​

Getting Started with Regular Expressions​

Basic Pattern Matching​

Core Regular Expression Functions​

1. re.search(pattern, string)​

2. re.match(pattern, string)​

3. re.findall(pattern, string)​

4. re.sub(pattern, replacement, string)​

Special Characters in Regular Expressions​

Character Classes and Quantifiers​

Character Classes​

Quantifiers​

Grouping and Capturing​

Named Groups​

Real-World Applications​

Example 1: Email Validation​

Example 2: Extracting Information from a Log File​

Example 3: Parsing HTML (Basic)​

Compilation Flags​

Compiling Regular Expressions​

Common Pitfalls and Tips​

Summary​

Exercises​

Additional Resources​

Introduction

Why Learn Regular Expressions?

Getting Started with Regular Expressions

Basic Pattern Matching

Core Regular Expression Functions

1. `re.search(pattern, string)`

2. `re.match(pattern, string)`

3. `re.findall(pattern, string)`

4. `re.sub(pattern, replacement, string)`

Special Characters in Regular Expressions

Character Classes and Quantifiers

Character Classes

Quantifiers

Grouping and Capturing

Named Groups

Real-World Applications

Example 1: Email Validation

Example 2: Extracting Information from a Log File

Example 3: Parsing HTML (Basic)

Compilation Flags

Compiling Regular Expressions

Common Pitfalls and Tips

Summary

Exercises

Additional Resources