Ubuntu Regular Expressions

Regular expressions (regex) are powerful pattern-matching tools that allow you to search, match, and manipulate text in sophisticated ways. In Ubuntu shell scripting, regular expressions are essential for processing log files, validating input, extracting data, and automating various text manipulation tasks.

Introduction to Regular Expressions

A regular expression is a sequence of characters that defines a search pattern. These patterns can be used to:

Match text that follows specific patterns
Find and replace text
Extract information from files and strings
Validate input formats (like email addresses, phone numbers, etc.)

In Ubuntu, several command-line tools support regular expressions, including grep, sed, awk, and various programming languages. While the specific implementation may vary slightly between tools, the fundamental concepts remain consistent.

Basic Regular Expression Syntax

Let's start with the basic building blocks of regular expressions:

Literal Characters

Any regular character in a pattern matches itself. For example, the pattern ubuntu will match the string "ubuntu".

bash
$ echo "I love ubuntu linux" | grep "ubuntu"
I love ubuntu linux

Special Characters and Metacharacters

Regular expressions include special characters (metacharacters) that have specific meanings:

Metacharacter	Description	Example
`.`	Matches any single character except newline	`a.c` matches "abc", "adc", "a1c", etc.
`^`	Matches the start of a line	`^ubuntu` matches "ubuntu" only at the beginning of a line
`$`	Matches the end of a line	`ubuntu$` matches "ubuntu" only at the end of a line
`*`	Matches zero or more occurrences of the previous character	`ab*c` matches "ac", "abc", "abbc", etc.
`+`	Matches one or more occurrences of the previous character	`ab+c` matches "abc", "abbc", but not "ac"
`?`	Matches zero or one occurrence of the previous character	`ab?c` matches "ac" and "abc", but not "abbc"
`\`	Escapes special characters	`\.` matches a literal period instead of any character
`[]`	Character class - matches any character within the brackets	`[aeiou]` matches any vowel
`[^]`	Negated character class - matches any character not within the brackets	`[^aeiou]` matches any non-vowel
`()`	Groups patterns together	`(ubuntu)` groups the pattern "ubuntu" for backreferences
`\|`	Alternation - matches either pattern	`ubuntu\|debian` matches either "ubuntu" or "debian"

Common Regular Expression Examples

Let's explore some practical examples of regular expressions in Ubuntu:

Example 1: Matching IP Addresses

A pattern to match IPv4 addresses:

bash
$ ifconfig | grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
        inet 192.168.1.10  netmask 255.255.255.0  broadcast 192.168.1.255
        inet 127.0.0.1  netmask 255.0.0.0

Example 2: Finding Email Addresses in a File

bash
$ grep -E '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' contacts.txt
[email protected]
[email protected]

Example 3: Validating Phone Numbers

bash
$ echo "My phone number is 123-456-7890" | grep -E '\b[0-9]{3}-[0-9]{3}-[0-9]{4}\b'
My phone number is 123-456-7890

Regular Expressions with Common Ubuntu Tools

grep

The grep command is one of the most common tools for using regular expressions in Ubuntu:

bash
# Basic grep
$ grep "ubuntu" file.txt

# Using extended regular expressions
$ grep -E "ubuntu|debian" file.txt

# Case-insensitive search
$ grep -i "UBUNTU" file.txt

# Show line numbers
$ grep -n "ubuntu" file.txt

# Recursive search through directories
$ grep -r "ubuntu" /path/to/directory/

sed

The Stream Editor sed is perfect for search and replace operations:

bash
# Replace first occurrence of 'ubuntu' with 'Ubuntu' on each line
$ sed 's/ubuntu/Ubuntu/' file.txt

# Replace all occurrences of 'ubuntu' with 'Ubuntu'
$ sed 's/ubuntu/Ubuntu/g' file.txt

# Replace 'ubuntu' with 'Ubuntu' only if the line contains 'linux'
$ sed '/linux/s/ubuntu/Ubuntu/g' file.txt

# Delete lines matching a pattern
$ sed '/^#/d' file.txt  # Deletes comment lines starting with #

awk

The awk programming language is excellent for processing text data:

bash
# Print lines where the first field matches 'ubuntu'
$ awk '$1 ~ /ubuntu/' file.txt

# Sum the values in the third column
$ awk '{sum += $3} END {print sum}' file.txt

# Print lines that match a regular expression
$ awk '/ubuntu/' file.txt

# Format output based on patterns
$ awk '/error/ {print "ERROR: " $0}; /warning/ {print "WARNING: " $0}' log.txt

Character Classes and Shorthand Notations

Regular expressions offer shorthand notations for common character classes:

Notation	Description	Equivalent
`\d`	Digit	`[0-9]`
`\D`	Non-digit	`[^0-9]`
`\w`	Word character	`[A-Za-z0-9_]`
`\W`	Non-word character	`[^A-Za-z0-9_]`
`\s`	Whitespace	`[ \t
\r\f]`
`\S`	Non-whitespace	`[^ \t
\r\f]`

Note: In some tools like grep, you might need to use -P (Perl-compatible) flag to use these shorthands, or double escape them (\\d).

Quantifiers in Regular Expressions

Quantifiers specify how many instances of a character, group, or character class must be present for a match:

Quantifier	Description
`*`	Match 0 or more times
`+`	Match 1 or more times
`?`	Match 0 or 1 time
`{n}`	Match exactly n times
`{n,}`	Match at least n times
`{n,m}`	Match between n and m times

Example:

bash
# Match lines containing exactly 8-digit numbers
$ grep -E '\b[0-9]{8}\b' file.txt

# Match lines with words having 5 to 10 characters
$ grep -E '\b\w{5,10}\b' file.txt

Regular Expressions in Shell Scripts

Let's create a practical shell script that validates user input using regular expressions:

bash
#!/bin/bash

# Function to validate email address
validate_email() {
    local email=$1
    local regex="^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"
    
    if [[ $email =~ $regex ]]; then
        echo "Valid email address!"
        return 0
    else
        echo "Invalid email address!"
        return 1
    fi
}

# Function to validate IP address
validate_ip() {
    local ip=$1
    local regex="^([0-9]{1,3}\.){3}[0-9]{1,3}$"
    
    if [[ $ip =~ $regex ]]; then
        # Further check each octet
        IFS='.' read -r -a octets <<< "$ip"
        for octet in "${octets[@]}"; do
            if (( octet > 255 )); then
                echo "Invalid IP address: octet $octet exceeds 255!"
                return 1
            fi
        done
        echo "Valid IP address!"
        return 0
    else
        echo "Invalid IP address format!"
        return 1
    fi
}

# Main script
echo "===== Input Validation with Regular Expressions ====="
echo

# Email validation
read -p "Enter an email address: " email_input
validate_email "$email_input"
echo

# IP validation
read -p "Enter an IP address: " ip_input
validate_ip "$ip_input"

Example Usage and Output:

===== Input Validation with Regular Expressions =====

Enter an email address: [email protected]
Valid email address!

Enter an IP address: 192.168.1.1
Valid IP address!

Advanced Regex Techniques

Lookahead and Lookbehind Assertions

These allow you to match patterns only if they're followed or preceded by another pattern:

bash
# Positive lookahead: Match 'ubuntu' only if followed by 'linux'
$ grep -P 'ubuntu(?=linux)' file.txt

# Negative lookahead: Match 'ubuntu' only if NOT followed by 'linux'
$ grep -P 'ubuntu(?!linux)' file.txt

# Positive lookbehind: Match 'linux' only if preceded by 'ubuntu'
$ grep -P '(?<=ubuntu)linux' file.txt

# Negative lookbehind: Match 'linux' only if NOT preceded by 'ubuntu'
$ grep -P '(?<!ubuntu)linux' file.txt

Note: These require the -P (Perl-compatible) flag in grep.

Backreferences

Backreferences allow you to match the same text that was matched by a capturing group:

bash
# Match repeated words
$ grep -E '\b(\w+)\s+\1\b' file.txt

# Example output for "The the cat sat on the mat":
The the cat sat on the mat

Real-world Application: Log File Analysis

Here's a practical example of using regular expressions to extract information from log files:

bash
#!/bin/bash

# Script to analyze Apache access log

LOG_FILE="/var/log/apache2/access.log"

# Count total number of requests
total_requests=$(wc -l < "$LOG_FILE")
echo "Total Requests: $total_requests"

# Count 404 errors
not_found=$(grep -c ' 404 ' "$LOG_FILE")
echo "404 Not Found Errors: $not_found"

# Count unique IP addresses
unique_ips=$(grep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' "$LOG_FILE" | sort -u | wc -l)
echo "Unique IP Addresses: $unique_ips"

# Find the most requested URL
echo "Most Requested URL:"
grep -oE 'GET [^ ]+' "$LOG_FILE" | sort | uniq -c | sort -nr | head -1

# Extract all requests from a specific IP
read -p "Enter an IP to show its requests: " target_ip
echo "Requests from $target_ip:"
grep "$target_ip" "$LOG_FILE"

# Find all requests made between a specific time range
echo "Requests between 10:00 and 11:00:"
grep -E '([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]' "$LOG_FILE" | grep -E '10:[0-5][0-9]:[0-5][0-9]'

Working with Regular Expressions Interactively

For testing and learning regular expressions, several online tools are available. However, in Ubuntu, you can use the regex-tester package:

bash
# Install regex-tester
$ sudo apt install regex-tester

# Use regexscan for immediate testing
$ echo "This is an ubuntu system" | regexscan 'ubuntu'

Summary

Regular expressions are invaluable tools for text processing in Ubuntu shell scripting. In this guide, we've covered:

Basic regex syntax and metacharacters
Common patterns and examples
Using regex with popular Ubuntu tools (grep, sed, awk)
Character classes and quantifiers
Implementing regex in shell scripts
Advanced techniques like lookahead/lookbehind and backreferences
Real-world applications for log file analysis

With practice, you'll be able to craft precise patterns to match exactly what you need, making your text processing tasks much more efficient.

Exercises for Practice

Write a regular expression to match valid Ubuntu version numbers (e.g., 20.04, 22.10).
Create a shell script that uses regular expressions to validate passwords based on these rules:
- At least 8 characters
- Contains at least one uppercase letter
- Contains at least one lowercase letter
- Contains at least one number
- Contains at least one special character
Write a command using grep to extract all URLs from an HTML file.
Create a sed command to convert dates from MM/DD/YYYY format to YYYY-MM-DD format in a text file.
Write an awk script to extract and sum all numbers that appear after the word "Total:" in a log file.

Additional Resources

The GNU Regex manual: man 7 regex
Ubuntu Community Help: Regular Expressions (help.ubuntu.com)
Book: "Mastering Regular Expressions" by Jeffrey Friedl
Online regex testing tools: RegExr and Regex101
man grep, man sed, and man awk for detailed documentation on these tools

Remember that the key to mastering regular expressions is practice. Start with simple patterns and gradually tackle more complex ones as your understanding grows.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Regular Expressions​

Basic Regular Expression Syntax​

Literal Characters​

Special Characters and Metacharacters​

Common Regular Expression Examples​

Example 1: Matching IP Addresses​

Example 2: Finding Email Addresses in a File​

Example 3: Validating Phone Numbers​

Regular Expressions with Common Ubuntu Tools​

grep​

sed​

awk​

Character Classes and Shorthand Notations​

Quantifiers in Regular Expressions​

Regular Expressions in Shell Scripts​

Advanced Regex Techniques​

Lookahead and Lookbehind Assertions​

Backreferences​

Real-world Application: Log File Analysis​

Working with Regular Expressions Interactively​

Summary​

Exercises for Practice​

Additional Resources​