Rabin-Karp Algorithm
Introduction
The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find patterns within text. Created by Michael O. Rabin and Richard M. Karp in 1987, it's particularly useful for multiple pattern searches and plagiarism detection. Unlike brute force methods that compare each character individually, Rabin-Karp uses a clever mathematical approach to make string matching more efficient.
This algorithm is especially valuable when:
- You need to search for a pattern in a large body of text
- You want to search for multiple patterns simultaneously
- You're working on applications like plagiarism detection or DNA sequence matching
How Rabin-Karp Works
Core Concepts
The Rabin-Karp algorithm is built around three key ideas:
- Hash Functions: Converting strings into numerical values
- Rolling Hash: Efficiently computing hash values for successive substrings
- Hash Comparison: Comparing hash values before doing character-by-character matching
Hash Function
A hash function converts a string into a numerical value. For Rabin-Karp, we need a hash function that can be updated efficiently as we slide our window through the text.
The common approach uses polynomial hash functions:
For a string s
of length m
, the hash value is calculated as:
hash(s) = s[0] * b^(m-1) + s[1] * b^(m-2) + ... + s[m-1] * b^0
where:
s[i]
is the ASCII (or other encoding) value of character at position ib
is a base value (typically a prime number)
Rolling Hash
The real power of Rabin-Karp comes from the rolling hash technique. When we shift our window by one character, we can calculate the new hash from the previous hash in O(1) time:
hash(s[1...m]) = (hash(s[0...m-1]) - s[0] * b^(m-1)) * b + s[m]
To avoid working with very large numbers, we typically use modular arithmetic with a large prime number.
Algorithm Steps
- Calculate the hash value of the pattern
- Calculate the hash value of the first window of text (same length as the pattern)
- For each possible position in the text:
- Compare the hash values of the pattern and the current text window
- If the hashes match, verify that the characters actually match
- Slide the window by one character and update the hash value
Implementation
Here's a Python implementation of the Rabin-Karp algorithm:
def rabin_karp(text, pattern):
# Base values for the rolling hash
base = 256
prime = 101
n = len(text)
m = len(pattern)
results = []
# Calculate hash value for pattern and first window of text
pattern_hash = 0
text_hash = 0
highest_power = 1
# Calculate the value of base^(m-1) % prime
for i in range(m-1):
highest_power = (highest_power * base) % prime
# Calculate initial hash values
for i in range(m):
pattern_hash = (base * pattern_hash + ord(pattern[i])) % prime
text_hash = (base * text_hash + ord(text[i])) % prime
# Slide the pattern over text one by one
for i in range(n - m + 1):
# Check if the hash values match
if pattern_hash == text_hash:
# Verify character by character
match = True
for j in range(m):
if text[i + j] != pattern[j]:
match = False
break
if match:
results.append(i)
# Calculate hash value for next window
if i < n - m:
text_hash = (base * (text_hash - ord(text[i]) * highest_power) + ord(text[i + m])) % prime
# We might get negative value, converting it to positive
if text_hash < 0:
text_hash += prime
return results
Example Usage
text = "AABABACABABABA"
pattern = "ABABABA"
matches = rabin_karp(text, pattern)
print(f"Pattern found at positions: {matches}")
# Output: Pattern found at positions: [7]
Step-by-Step Example
Let's trace through a simple example:
- Text: "AABABAC"
- Pattern: "ABA"
- Base (b): 256
- Prime: 101
Step 1: Calculate hash values
- Pattern "ABA" hash: (65 * 256^2 + 66 * 256^1 + 65 * 256^0) % 101 = 67
- First window "AAB" hash: (65 * 256^2 + 65 * 256^1 + 66 * 256^0) % 101 = 31
Step 2: Start sliding window
- Compare hash of "AAB" (31) with pattern (67) → Different, continue
- Calculate rolling hash for "ABA":
- Remove 'A': (31 - 65 * 256^2) * 256
- Add 'A': + 65
- Apply modulo: % 101
- Result: 67
- Compare hash of "ABA" (67) with pattern (67) → Same! Check characters:
- "ABA" matches "ABA" → Found at position 1
- Continue this process...
Real-World Applications
1. Plagiarism Detection
Rabin-Karp is commonly used in plagiarism detection software. By breaking documents into smaller chunks and hashing them, the algorithm can quickly identify matching sections between documents.
def check_plagiarism(document1, document2, chunk_size=5):
doc1_chunks = []
# Create chunks of document1
for i in range(len(document1) - chunk_size + 1):
doc1_chunks.append(document1[i:i+chunk_size])
matching_chunks = 0
# Check each possible chunk of document2
for i in range(len(document2) - chunk_size + 1):
chunk = document2[i:i+chunk_size]
if chunk in doc1_chunks:
matching_chunks += 1
# Calculate similarity percentage
total_possible_chunks = len(document1) - chunk_size + 1 + len(document2) - chunk_size + 1
similarity = (matching_chunks * 2 / total_possible_chunks) * 100
return similarity
2. DNA Sequence Matching
In bioinformatics, Rabin-Karp is used to find specific sequences within DNA strings:
def find_dna_sequence(genome, target_sequence):
positions = rabin_karp(genome, target_sequence)
if positions:
return f"Target DNA sequence found at positions: {positions}"
else:
return "Target DNA sequence not found in the genome"
3. Multiple Pattern Matching
One of Rabin-Karp's advantages is its ability to search for multiple patterns in one pass:
def multi_pattern_rabin_karp(text, patterns):
results = {}
pattern_hashes = {}
# Pre-compute all pattern hashes
for pattern in patterns:
pattern_hash = 0
for char in pattern:
pattern_hash = (256 * pattern_hash + ord(char)) % 101
pattern_hashes[pattern] = pattern_hash
results[pattern] = []
# Search for each pattern
for pattern in patterns:
matches = rabin_karp(text, pattern)
results[pattern] = matches
return results
Performance Analysis
-
Time Complexity:
- Average case: O(n + m) where n is the length of the text and m is the length of the pattern
- Worst case: O(n*m) when there are many hash collisions
-
Space Complexity: O(1) for the standard algorithm, O(k) when searching for k patterns
The key advantage of Rabin-Karp over naive string matching is its ability to skip unnecessary character comparisons using the hashing technique.
Limitations and Considerations
- Hash Collisions: Different strings can have the same hash value, requiring character-by-character verification
- Base and Prime Selection: The choice of base and prime affects performance and collision frequency
- Rolling Hash Precision: With very large texts, there's a risk of hash value overflow
Summary
The Rabin-Karp algorithm provides an efficient approach to string matching through smart use of hash functions. It performs especially well when:
- Searching for multiple patterns simultaneously
- Working with applications like plagiarism detection
- Pattern matching is a frequent operation
The algorithm's ability to compute hashes in constant time when sliding the window makes it substantially faster than naive approaches in many real-world scenarios.
Exercises
- Implement the Rabin-Karp algorithm to find all occurrences of a pattern in a text.
- Modify the algorithm to search for multiple patterns in a single text in one pass.
- Experiment with different base and prime values. How do they affect the algorithm's performance?
- Implement a simple plagiarism detector using Rabin-Karp to compare two documents.
- Compare the performance of Rabin-Karp with other string matching algorithms like Naive, KMP, and Boyer-Moore for different input sizes.
Additional Resources
- "Introduction to Algorithms" by Cormen, Leiserson, Rivest, and Stein
- "Algorithms on Strings, Trees, and Sequences" by Dan Gusfield
- Stanford's CS Education Library on String Algorithms
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)