Python Regular Expressions
Introduction
Regular expressions (regex or regexp) are powerful sequences of characters that define search patterns. In Python, regular expressions allow you to search for patterns in text, validate input, extract information, and perform complex string manipulations that would be difficult with standard string methods.
The re
module in Python provides operations for working with regular expressions. This module enables you to check if a particular string matches a given pattern, extract parts of a string that match your criteria, or replace text that matches a pattern.
Why Learn Regular Expressions?
- Powerful text processing: Accomplish complex pattern matching in just a few lines of code
- Data validation: Verify if inputs match expected formats (email addresses, phone numbers, etc.)
- Data extraction: Pull specific information from larger text bodies
- Text transformation: Replace or reformat text based on patterns
Getting Started with Regular Expressions
To use regular expressions in Python, you'll need to import the re
module first:
import re
Basic Pattern Matching
Let's start with a simple example - checking if a string contains a specific pattern:
import re
text = "Python is awesome"
pattern = "Python"
result = re.search(pattern, text)
if result:
print("Pattern found!")
else:
print("Pattern not found.")
Output:
Pattern found!
The search()
function looks for the first occurrence of the pattern in the text and returns a match object if found, or None
if not found.
Core Regular Expression Functions
Python's re
module provides several key functions for working with patterns:
1. re.search(pattern, string)
Searches for the first occurrence of the pattern in the string.
import re
text = "Python was created in 1991 by Guido van Rossum"
result = re.search(r"created in (\d+)", text)
if result:
print(f"Python was created in {result.group(1)}")
Output:
Python was created in 1991
2. re.match(pattern, string)
Checks if the pattern matches at the beginning of the string.
import re
# This will match
text1 = "Python is great"
result1 = re.match(r"Python", text1)
print(result1 is not None) # True
# This won't match
text2 = "I love Python"
result2 = re.match(r"Python", text2)
print(result2 is not None) # False
Output:
True
False
3. re.findall(pattern, string)
Returns all non-overlapping matches of the pattern in the string as a list.
import re
text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)
Output:
4. re.sub(pattern, replacement, string)
Replaces occurrences of the pattern in the string with the replacement text.
import re
text = "Python was created in 1991"
result = re.sub(r"\d+", "YEAR", text)
print(result)
Output:
Python was created in YEAR
Special Characters in Regular Expressions
Regular expressions use special characters to represent patterns:
Character | Description | Example |
---|---|---|
. | Any character except newline | a.b matches "acb", "adb", etc. |
^ | Start of string | ^Python matches "Python" at the beginning |
$ | End of string | Python$ matches "Python" at the end |
* | 0 or more occurrences | ab*c matches "ac", "abc", "abbc", etc. |
+ | 1 or more occurrences | ab+c matches "abc", "abbc", but not "ac" |
? | 0 or 1 occurrence | ab?c matches "ac" and "abc" |
\d | Any digit | \d+ matches one or more digits |
\w | Any alphanumeric character | \w+ matches one or more letters/digits/underscores |
\s | Any whitespace | \s+ matches one or more spaces, tabs, newlines |
[] | Character class | [abc] matches "a", "b", or "c" |
() | Groups patterns | (ab)+ matches "ab", "abab", etc. |
\b | Word boundary | \bword\b matches the whole word "word" |
Let's see some examples:
import re
# Match any email pattern
text = "Contact me at [email protected]"
email = re.search(r"\b[\w.]+@[\w.]+\.\w+\b", text)
print(email.group()) # [email protected]
# Find all words starting with 'p'
text = "Python programming is powerful and practical"
p_words = re.findall(r"\bp\w+", text, re.IGNORECASE)
print(p_words) # ['Python', 'programming', 'powerful', 'practical']
# Extract all dates in MM/DD/YYYY format
text = "Event dates: 12/25/2023, 01/15/2024, and 02/28/2024."
dates = re.findall(r"\b\d{2}/\d{2}/\d{4}\b", text)
print(dates) # ['12/25/2023', '01/15/2024', '02/28/2024']
Character Classes and Quantifiers
Character Classes
Character classes allow you to match any one character from a set:
import re
# Match any vowel
text = "Python"
vowels = re.findall(r"[aeiou]", text)
print(vowels) # ['o']
# Match any digit
text = "Python was created in 1991"
digits = re.findall(r"[0-9]", text)
print(digits) # ['1', '9', '9', '1']
# Match characters within a range
text = "The ZIP code is 90210"
result = re.search(r"[0-9]{5}", text)
print(result.group()) # 90210
Quantifiers
Quantifiers specify how many times a character or group should occur:
import re
# Match 0 or more 'o's
text = "Gooooooal!"
result = re.search(r"Go*al", text)
print(result.group()) # Gooooooal
# Match exactly 3 digits
text = "The area code is 415."
result = re.search(r"\d{3}", text)
print(result.group()) # 415
# Match between 2 and 4 occurrences of 'ha'
text = "hahahahaha"
result = re.findall(r"(ha){2,4}", text)
print(result) # ['ha', 'ha']
Grouping and Capturing
Parentheses ()
are used to group parts of a regular expression and capture matched text:
import re
# Extract information from text
text = "John Smith was born on 1990-05-15"
result = re.search(r"(\w+) (\w+) was born on (\d{4})-(\d{2})-(\d{2})", text)
if result:
print(f"Full name: {result.group(1)} {result.group(2)}")
print(f"Birth year: {result.group(3)}")
print(f"Birth month: {result.group(4)}")
print(f"Birth day: {result.group(5)}")
Output:
Full name: John Smith
Birth year: 1990
Birth month: 05
Birth day: 15
Named Groups
For more readability, you can use named groups with (?P<name>pattern)
syntax:
import re
text = "John Smith was born on 1990-05-15"
pattern = r"(?P<first>\w+) (?P<last>\w+) was born on (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
result = re.search(pattern, text)
if result:
print(f"First name: {result.group('first')}")
print(f"Last name: {result.group('last')}")
print(f"Birth date: {result.group('month')}/{result.group('day')}/{result.group('year')}")
Output:
First name: John
Last name: Smith
Birth date: 05/15/1990
Real-World Applications
Example 1: Email Validation
import re
def validate_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
return bool(re.match(pattern, email))
# Test the function
emails = ["[email protected]", "invalid.email@", "[email protected]", "not-an-email"]
for email in emails:
if validate_email(email):
print(f"{email} is a valid email address")
else:
print(f"{email} is NOT a valid email address")
Output:
[email protected] is a valid email address
invalid.email@ is NOT a valid email address
[email protected] is a valid email address
not-an-email is NOT a valid email address
Example 2: Extracting Information from a Log File
import re
log = """
2023-03-15 14:23:45 INFO User logged in: alice
2023-03-15 14:25:16 ERROR Failed to connect to database
2023-03-15 14:26:02 WARNING Disk usage above 80%
2023-03-15 14:30:45 INFO User logged out: alice
"""
# Extract all timestamps and messages
pattern = r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.+)"
matches = re.findall(pattern, log)
for timestamp, level, message in matches:
print(f"[{timestamp}] {level}: {message}")
# Extract only ERROR entries
errors = re.findall(r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} ERROR (.+)", log)
print("\nErrors found:")
for error in errors:
print(f"- {error}")
Output:
[2023-03-15 14:23:45] INFO: User logged in: alice
[2023-03-15 14:25:16] ERROR: Failed to connect to database
[2023-03-15 14:26:02] WARNING: Disk usage above 80%
[2023-03-15 14:30:45] INFO: User logged out: alice
Errors found:
- Failed to connect to database
Example 3: Parsing HTML (Basic)
import re
html = """
<div class="content">
<h1>Welcome to Python</h1>
<p>Learn Python <a href="https://python.org">here</a>.</p>
<p>Or check out <a href="https://docs.python.org">the documentation</a>.</p>
</div>
"""
# Extract all links
links = re.findall(r'<a href="([^"]+)"', html)
print("Links found:", links)
# Extract headers
headers = re.findall(r'<h(\d)>([^<]+)</h\1>', html)
for level, text in headers:
print(f"Header (level {level}): {text}")
Output:
Links found: ['https://python.org', 'https://docs.python.org']
Header (level 1): Welcome to Python
Compilation Flags
Python's re
module provides flags that modify how patterns are interpreted:
import re
text = "Python is amazing\nPYTHON is powerful"
# Case-insensitive matching
results1 = re.findall(r"python", text, re.IGNORECASE)
print(results1) # ['Python', 'PYTHON']
# Multi-line mode: ^ and $ match at line breaks
results2 = re.findall(r"^python", text, re.IGNORECASE | re.MULTILINE)
print(results2) # ['Python', 'PYTHON']
Common flags include:
re.IGNORECASE
: Case-insensitive matchingre.MULTILINE
: Make^
and$
match at line breaksre.DOTALL
: Make.
match newlines toore.VERBOSE
: Allow comments and whitespace in patterns
Compiling Regular Expressions
If you're using the same pattern multiple times, compiling it can improve performance:
import re
# Compile the pattern once
email_pattern = re.compile(r'\b[\w\.-]+@[\w\.-]+\.\w+\b')
# Use it multiple times
text1 = "Contact us at [email protected]"
text2 = "Send your resume to [email protected]"
print(email_pattern.search(text1).group()) # [email protected]
print(email_pattern.search(text2).group()) # [email protected]
Common Pitfalls and Tips
-
Escape special characters when you want to match them literally:
python# To match a literal period:
pattern = r"example\.com" # NOT "example.com" -
Be careful with greedy quantifiers (
*
,+
, etc.) as they match as much as possible:pythontext = "<div>Content here</div>"
greedy = re.search(r"<.+>", text).group() # Matches "<div>Content here</div>"
non_greedy = re.search(r"<.+?>", text).group() # Matches "<div>" (use ? after quantifier) -
Use raw strings (
r"pattern"
) to avoid issues with backslashes:pythonpattern = r"\d+" # Correct: Matches digits
# NOT: pattern = "\d+" # Incorrect: \d becomes a special character -
Test your patterns on edge cases to ensure they work as expected.
Summary
Regular expressions are a powerful tool for text processing in Python. We've covered:
- Basic pattern matching with
search()
,match()
,findall()
, andsub()
- Special characters and metacharacters for constructing patterns
- Character classes and quantifiers for more complex matching
- Grouping and capturing specific parts of matched text
- Real-world applications like email validation and log parsing
- Tips to avoid common regex pitfalls
With regular expressions, you can perform sophisticated text operations in just a few lines of code, making them an essential tool in a Python programmer's toolkit.
Exercises
- Write a regular expression to extract all phone numbers in the format (XXX) XXX-XXXX from text.
- Create a function that validates if a password meets these criteria:
- At least 8 characters long
- Contains at least one uppercase letter
- Contains at least one lowercase letter
- Contains at least one digit
- Contains at least one special character
- Write a regex to extract all hashtags (words starting with #) from a social media post.
- Create a function that can extract all dates in the format MM/DD/YYYY, DD/MM/YYYY, or YYYY-MM-DD from text.
Additional Resources
- Python's re module documentation
- Regular Expression HOWTO
- Regex101 - An online regex tester with explanations
- RegexOne - Interactive regex tutorial
Remember that regular expressions are a skill that improves with practice. Start with simple patterns and gradually tackle more complex text processing challenges as you become comfortable with the syntax.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)