Skip to main content

Python Regular Expressions

Introduction

Regular expressions (regex or regexp) are powerful sequences of characters that define search patterns. In Python, regular expressions allow you to search for patterns in text, validate input, extract information, and perform complex string manipulations that would be difficult with standard string methods.

The re module in Python provides operations for working with regular expressions. This module enables you to check if a particular string matches a given pattern, extract parts of a string that match your criteria, or replace text that matches a pattern.

Why Learn Regular Expressions?

  • Powerful text processing: Accomplish complex pattern matching in just a few lines of code
  • Data validation: Verify if inputs match expected formats (email addresses, phone numbers, etc.)
  • Data extraction: Pull specific information from larger text bodies
  • Text transformation: Replace or reformat text based on patterns

Getting Started with Regular Expressions

To use regular expressions in Python, you'll need to import the re module first:

python
import re

Basic Pattern Matching

Let's start with a simple example - checking if a string contains a specific pattern:

python
import re

text = "Python is awesome"
pattern = "Python"

result = re.search(pattern, text)
if result:
print("Pattern found!")
else:
print("Pattern not found.")

Output:

Pattern found!

The search() function looks for the first occurrence of the pattern in the text and returns a match object if found, or None if not found.

Core Regular Expression Functions

Python's re module provides several key functions for working with patterns:

1. re.search(pattern, string)

Searches for the first occurrence of the pattern in the string.

python
import re

text = "Python was created in 1991 by Guido van Rossum"
result = re.search(r"created in (\d+)", text)

if result:
print(f"Python was created in {result.group(1)}")

Output:

Python was created in 1991

2. re.match(pattern, string)

Checks if the pattern matches at the beginning of the string.

python
import re

# This will match
text1 = "Python is great"
result1 = re.match(r"Python", text1)
print(result1 is not None) # True

# This won't match
text2 = "I love Python"
result2 = re.match(r"Python", text2)
print(result2 is not None) # False

Output:

True
False

3. re.findall(pattern, string)

Returns all non-overlapping matches of the pattern in the string as a list.

python
import re

text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)

Output:

4. re.sub(pattern, replacement, string)

Replaces occurrences of the pattern in the string with the replacement text.

python
import re

text = "Python was created in 1991"
result = re.sub(r"\d+", "YEAR", text)
print(result)

Output:

Python was created in YEAR

Special Characters in Regular Expressions

Regular expressions use special characters to represent patterns:

CharacterDescriptionExample
.Any character except newlinea.b matches "acb", "adb", etc.
^Start of string^Python matches "Python" at the beginning
$End of stringPython$ matches "Python" at the end
*0 or more occurrencesab*c matches "ac", "abc", "abbc", etc.
+1 or more occurrencesab+c matches "abc", "abbc", but not "ac"
?0 or 1 occurrenceab?c matches "ac" and "abc"
\dAny digit\d+ matches one or more digits
\wAny alphanumeric character\w+ matches one or more letters/digits/underscores
\sAny whitespace\s+ matches one or more spaces, tabs, newlines
[]Character class[abc] matches "a", "b", or "c"
()Groups patterns(ab)+ matches "ab", "abab", etc.
\bWord boundary\bword\b matches the whole word "word"

Let's see some examples:

python
import re

# Match any email pattern
text = "Contact me at [email protected]"
email = re.search(r"\b[\w.]+@[\w.]+\.\w+\b", text)
print(email.group()) # [email protected]

# Find all words starting with 'p'
text = "Python programming is powerful and practical"
p_words = re.findall(r"\bp\w+", text, re.IGNORECASE)
print(p_words) # ['Python', 'programming', 'powerful', 'practical']

# Extract all dates in MM/DD/YYYY format
text = "Event dates: 12/25/2023, 01/15/2024, and 02/28/2024."
dates = re.findall(r"\b\d{2}/\d{2}/\d{4}\b", text)
print(dates) # ['12/25/2023', '01/15/2024', '02/28/2024']

Character Classes and Quantifiers

Character Classes

Character classes allow you to match any one character from a set:

python
import re

# Match any vowel
text = "Python"
vowels = re.findall(r"[aeiou]", text)
print(vowels) # ['o']

# Match any digit
text = "Python was created in 1991"
digits = re.findall(r"[0-9]", text)
print(digits) # ['1', '9', '9', '1']

# Match characters within a range
text = "The ZIP code is 90210"
result = re.search(r"[0-9]{5}", text)
print(result.group()) # 90210

Quantifiers

Quantifiers specify how many times a character or group should occur:

python
import re

# Match 0 or more 'o's
text = "Gooooooal!"
result = re.search(r"Go*al", text)
print(result.group()) # Gooooooal

# Match exactly 3 digits
text = "The area code is 415."
result = re.search(r"\d{3}", text)
print(result.group()) # 415

# Match between 2 and 4 occurrences of 'ha'
text = "hahahahaha"
result = re.findall(r"(ha){2,4}", text)
print(result) # ['ha', 'ha']

Grouping and Capturing

Parentheses () are used to group parts of a regular expression and capture matched text:

python
import re

# Extract information from text
text = "John Smith was born on 1990-05-15"
result = re.search(r"(\w+) (\w+) was born on (\d{4})-(\d{2})-(\d{2})", text)

if result:
print(f"Full name: {result.group(1)} {result.group(2)}")
print(f"Birth year: {result.group(3)}")
print(f"Birth month: {result.group(4)}")
print(f"Birth day: {result.group(5)}")

Output:

Full name: John Smith
Birth year: 1990
Birth month: 05
Birth day: 15

Named Groups

For more readability, you can use named groups with (?P<name>pattern) syntax:

python
import re

text = "John Smith was born on 1990-05-15"
pattern = r"(?P<first>\w+) (?P<last>\w+) was born on (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
result = re.search(pattern, text)

if result:
print(f"First name: {result.group('first')}")
print(f"Last name: {result.group('last')}")
print(f"Birth date: {result.group('month')}/{result.group('day')}/{result.group('year')}")

Output:

First name: John
Last name: Smith
Birth date: 05/15/1990

Real-World Applications

Example 1: Email Validation

python
import re

def validate_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
return bool(re.match(pattern, email))

# Test the function
emails = ["[email protected]", "invalid.email@", "[email protected]", "not-an-email"]

for email in emails:
if validate_email(email):
print(f"{email} is a valid email address")
else:
print(f"{email} is NOT a valid email address")

Output:

[email protected] is a valid email address
invalid.email@ is NOT a valid email address
[email protected] is a valid email address
not-an-email is NOT a valid email address

Example 2: Extracting Information from a Log File

python
import re

log = """
2023-03-15 14:23:45 INFO User logged in: alice
2023-03-15 14:25:16 ERROR Failed to connect to database
2023-03-15 14:26:02 WARNING Disk usage above 80%
2023-03-15 14:30:45 INFO User logged out: alice
"""

# Extract all timestamps and messages
pattern = r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.+)"
matches = re.findall(pattern, log)

for timestamp, level, message in matches:
print(f"[{timestamp}] {level}: {message}")

# Extract only ERROR entries
errors = re.findall(r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} ERROR (.+)", log)
print("\nErrors found:")
for error in errors:
print(f"- {error}")

Output:

[2023-03-15 14:23:45] INFO: User logged in: alice
[2023-03-15 14:25:16] ERROR: Failed to connect to database
[2023-03-15 14:26:02] WARNING: Disk usage above 80%
[2023-03-15 14:30:45] INFO: User logged out: alice

Errors found:
- Failed to connect to database

Example 3: Parsing HTML (Basic)

python
import re

html = """
<div class="content">
<h1>Welcome to Python</h1>
<p>Learn Python <a href="https://python.org">here</a>.</p>
<p>Or check out <a href="https://docs.python.org">the documentation</a>.</p>
</div>
"""

# Extract all links
links = re.findall(r'<a href="([^"]+)"', html)
print("Links found:", links)

# Extract headers
headers = re.findall(r'<h(\d)>([^<]+)</h\1>', html)
for level, text in headers:
print(f"Header (level {level}): {text}")

Output:

Links found: ['https://python.org', 'https://docs.python.org']
Header (level 1): Welcome to Python

Compilation Flags

Python's re module provides flags that modify how patterns are interpreted:

python
import re

text = "Python is amazing\nPYTHON is powerful"

# Case-insensitive matching
results1 = re.findall(r"python", text, re.IGNORECASE)
print(results1) # ['Python', 'PYTHON']

# Multi-line mode: ^ and $ match at line breaks
results2 = re.findall(r"^python", text, re.IGNORECASE | re.MULTILINE)
print(results2) # ['Python', 'PYTHON']

Common flags include:

  • re.IGNORECASE: Case-insensitive matching
  • re.MULTILINE: Make ^ and $ match at line breaks
  • re.DOTALL: Make . match newlines too
  • re.VERBOSE: Allow comments and whitespace in patterns

Compiling Regular Expressions

If you're using the same pattern multiple times, compiling it can improve performance:

python
import re

# Compile the pattern once
email_pattern = re.compile(r'\b[\w\.-]+@[\w\.-]+\.\w+\b')

# Use it multiple times
text1 = "Contact us at [email protected]"
text2 = "Send your resume to [email protected]"

print(email_pattern.search(text1).group()) # [email protected]
print(email_pattern.search(text2).group()) # [email protected]

Common Pitfalls and Tips

  1. Escape special characters when you want to match them literally:

    python
    # To match a literal period:
    pattern = r"example\.com" # NOT "example.com"
  2. Be careful with greedy quantifiers (*, +, etc.) as they match as much as possible:

    python
    text = "<div>Content here</div>"
    greedy = re.search(r"<.+>", text).group() # Matches "<div>Content here</div>"
    non_greedy = re.search(r"<.+?>", text).group() # Matches "<div>" (use ? after quantifier)
  3. Use raw strings (r"pattern") to avoid issues with backslashes:

    python
    pattern = r"\d+"   # Correct: Matches digits
    # NOT: pattern = "\d+" # Incorrect: \d becomes a special character
  4. Test your patterns on edge cases to ensure they work as expected.

Summary

Regular expressions are a powerful tool for text processing in Python. We've covered:

  • Basic pattern matching with search(), match(), findall(), and sub()
  • Special characters and metacharacters for constructing patterns
  • Character classes and quantifiers for more complex matching
  • Grouping and capturing specific parts of matched text
  • Real-world applications like email validation and log parsing
  • Tips to avoid common regex pitfalls

With regular expressions, you can perform sophisticated text operations in just a few lines of code, making them an essential tool in a Python programmer's toolkit.

Exercises

  1. Write a regular expression to extract all phone numbers in the format (XXX) XXX-XXXX from text.
  2. Create a function that validates if a password meets these criteria:
    • At least 8 characters long
    • Contains at least one uppercase letter
    • Contains at least one lowercase letter
    • Contains at least one digit
    • Contains at least one special character
  3. Write a regex to extract all hashtags (words starting with #) from a social media post.
  4. Create a function that can extract all dates in the format MM/DD/YYYY, DD/MM/YYYY, or YYYY-MM-DD from text.

Additional Resources

Remember that regular expressions are a skill that improves with practice. Start with simple patterns and gradually tackle more complex text processing challenges as you become comfortable with the syntax.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)