Rabin-Karp Algorithm

Introduction

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find patterns within text. Created by Michael O. Rabin and Richard M. Karp in 1987, it's particularly useful for multiple pattern searches and plagiarism detection. Unlike brute force methods that compare each character individually, Rabin-Karp uses a clever mathematical approach to make string matching more efficient.

This algorithm is especially valuable when:

You need to search for a pattern in a large body of text
You want to search for multiple patterns simultaneously
You're working on applications like plagiarism detection or DNA sequence matching

How Rabin-Karp Works

Core Concepts

The Rabin-Karp algorithm is built around three key ideas:

Hash Functions: Converting strings into numerical values
Rolling Hash: Efficiently computing hash values for successive substrings
Hash Comparison: Comparing hash values before doing character-by-character matching

Hash Function

A hash function converts a string into a numerical value. For Rabin-Karp, we need a hash function that can be updated efficiently as we slide our window through the text.

The common approach uses polynomial hash functions:

For a string s of length m, the hash value is calculated as:

hash(s) = s[0] * b^(m-1) + s[1] * b^(m-2) + ... + s[m-1] * b^0

where:

s[i] is the ASCII (or other encoding) value of character at position i
b is a base value (typically a prime number)

Rolling Hash

The real power of Rabin-Karp comes from the rolling hash technique. When we shift our window by one character, we can calculate the new hash from the previous hash in O(1) time:

hash(s[1...m]) = (hash(s[0...m-1]) - s[0] * b^(m-1)) * b + s[m]

To avoid working with very large numbers, we typically use modular arithmetic with a large prime number.

Algorithm Steps

Calculate the hash value of the pattern
Calculate the hash value of the first window of text (same length as the pattern)
For each possible position in the text:
- Compare the hash values of the pattern and the current text window
- If the hashes match, verify that the characters actually match
- Slide the window by one character and update the hash value

Implementation

Here's a Python implementation of the Rabin-Karp algorithm:

python
def rabin_karp(text, pattern):
    # Base values for the rolling hash
    base = 256
    prime = 101
    
    n = len(text)
    m = len(pattern)
    results = []
    
    # Calculate hash value for pattern and first window of text
    pattern_hash = 0
    text_hash = 0
    highest_power = 1
    
    # Calculate the value of base^(m-1) % prime
    for i in range(m-1):
        highest_power = (highest_power * base) % prime
    
    # Calculate initial hash values
    for i in range(m):
        pattern_hash = (base * pattern_hash + ord(pattern[i])) % prime
        text_hash = (base * text_hash + ord(text[i])) % prime
    
    # Slide the pattern over text one by one
    for i in range(n - m + 1):
        # Check if the hash values match
        if pattern_hash == text_hash:
            # Verify character by character
            match = True
            for j in range(m):
                if text[i + j] != pattern[j]:
                    match = False
                    break
            if match:
                results.append(i)
        
        # Calculate hash value for next window
        if i < n - m:
            text_hash = (base * (text_hash - ord(text[i]) * highest_power) + ord(text[i + m])) % prime
            # We might get negative value, converting it to positive
            if text_hash < 0:
                text_hash += prime
    
    return results

Example Usage

python
text = "AABABACABABABA"
pattern = "ABABABA"
matches = rabin_karp(text, pattern)
print(f"Pattern found at positions: {matches}")

# Output: Pattern found at positions: [7]

Step-by-Step Example

Let's trace through a simple example:

Text: "AABABAC"
Pattern: "ABA"
Base (b): 256
Prime: 101

Step 1: Calculate hash values

Pattern "ABA" hash: (65 * 256^2 + 66 * 256^1 + 65 * 256^0) % 101 = 67
First window "AAB" hash: (65 * 256^2 + 65 * 256^1 + 66 * 256^0) % 101 = 31

Step 2: Start sliding window

Compare hash of "AAB" (31) with pattern (67) → Different, continue
Calculate rolling hash for "ABA":
- Remove 'A': (31 - 65 * 256^2) * 256
- Add 'A': + 65
- Apply modulo: % 101
- Result: 67
Compare hash of "ABA" (67) with pattern (67) → Same! Check characters:
- "ABA" matches "ABA" → Found at position 1
Continue this process...

Real-World Applications

1. Plagiarism Detection

Rabin-Karp is commonly used in plagiarism detection software. By breaking documents into smaller chunks and hashing them, the algorithm can quickly identify matching sections between documents.

python
def check_plagiarism(document1, document2, chunk_size=5):
    doc1_chunks = []
    # Create chunks of document1
    for i in range(len(document1) - chunk_size + 1):
        doc1_chunks.append(document1[i:i+chunk_size])
    
    matching_chunks = 0
    # Check each possible chunk of document2
    for i in range(len(document2) - chunk_size + 1):
        chunk = document2[i:i+chunk_size]
        if chunk in doc1_chunks:
            matching_chunks += 1
    
    # Calculate similarity percentage
    total_possible_chunks = len(document1) - chunk_size + 1 + len(document2) - chunk_size + 1
    similarity = (matching_chunks * 2 / total_possible_chunks) * 100
    return similarity

2. DNA Sequence Matching

In bioinformatics, Rabin-Karp is used to find specific sequences within DNA strings:

python
def find_dna_sequence(genome, target_sequence):
    positions = rabin_karp(genome, target_sequence)
    if positions:
        return f"Target DNA sequence found at positions: {positions}"
    else:
        return "Target DNA sequence not found in the genome"

3. Multiple Pattern Matching

One of Rabin-Karp's advantages is its ability to search for multiple patterns in one pass:

python
def multi_pattern_rabin_karp(text, patterns):
    results = {}
    pattern_hashes = {}
    
    # Pre-compute all pattern hashes
    for pattern in patterns:
        pattern_hash = 0
        for char in pattern:
            pattern_hash = (256 * pattern_hash + ord(char)) % 101
        pattern_hashes[pattern] = pattern_hash
        results[pattern] = []
    
    # Search for each pattern
    for pattern in patterns:
        matches = rabin_karp(text, pattern)
        results[pattern] = matches
    
    return results

Performance Analysis

Time Complexity:
- Average case: O(n + m) where n is the length of the text and m is the length of the pattern
- Worst case: O(n*m) when there are many hash collisions
Space Complexity: O(1) for the standard algorithm, O(k) when searching for k patterns

The key advantage of Rabin-Karp over naive string matching is its ability to skip unnecessary character comparisons using the hashing technique.

Limitations and Considerations

Hash Collisions: Different strings can have the same hash value, requiring character-by-character verification
Base and Prime Selection: The choice of base and prime affects performance and collision frequency
Rolling Hash Precision: With very large texts, there's a risk of hash value overflow

Summary

The Rabin-Karp algorithm provides an efficient approach to string matching through smart use of hash functions. It performs especially well when:

Searching for multiple patterns simultaneously
Working with applications like plagiarism detection
Pattern matching is a frequent operation

The algorithm's ability to compute hashes in constant time when sliding the window makes it substantially faster than naive approaches in many real-world scenarios.

Exercises

Implement the Rabin-Karp algorithm to find all occurrences of a pattern in a text.
Modify the algorithm to search for multiple patterns in a single text in one pass.
Experiment with different base and prime values. How do they affect the algorithm's performance?
Implement a simple plagiarism detector using Rabin-Karp to compare two documents.
Compare the performance of Rabin-Karp with other string matching algorithms like Naive, KMP, and Boyer-Moore for different input sizes.

Additional Resources

"Introduction to Algorithms" by Cormen, Leiserson, Rivest, and Stein
"Algorithms on Strings, Trees, and Sequences" by Dan Gusfield
Stanford's CS Education Library on String Algorithms

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

How Rabin-Karp Works​

Core Concepts​

Hash Function​

Rolling Hash​

Algorithm Steps​

Implementation​

Example Usage​

Step-by-Step Example​

Step 1: Calculate hash values​

Step 2: Start sliding window​

Real-World Applications​

1. Plagiarism Detection​

2. DNA Sequence Matching​

3. Multiple Pattern Matching​

Performance Analysis​

Limitations and Considerations​

Summary​

Exercises​

Additional Resources​