Skip to main content

Rabin-Karp Algorithm

Introduction

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find patterns within text. Created by Michael O. Rabin and Richard M. Karp in 1987, it's particularly useful for multiple pattern searches and plagiarism detection. Unlike brute force methods that compare each character individually, Rabin-Karp uses a clever mathematical approach to make string matching more efficient.

This algorithm is especially valuable when:

  • You need to search for a pattern in a large body of text
  • You want to search for multiple patterns simultaneously
  • You're working on applications like plagiarism detection or DNA sequence matching

How Rabin-Karp Works

Core Concepts

The Rabin-Karp algorithm is built around three key ideas:

  1. Hash Functions: Converting strings into numerical values
  2. Rolling Hash: Efficiently computing hash values for successive substrings
  3. Hash Comparison: Comparing hash values before doing character-by-character matching

Hash Function

A hash function converts a string into a numerical value. For Rabin-Karp, we need a hash function that can be updated efficiently as we slide our window through the text.

The common approach uses polynomial hash functions:

For a string s of length m, the hash value is calculated as:

hash(s) = s[0] * b^(m-1) + s[1] * b^(m-2) + ... + s[m-1] * b^0

where:

  • s[i] is the ASCII (or other encoding) value of character at position i
  • b is a base value (typically a prime number)

Rolling Hash

The real power of Rabin-Karp comes from the rolling hash technique. When we shift our window by one character, we can calculate the new hash from the previous hash in O(1) time:

hash(s[1...m]) = (hash(s[0...m-1]) - s[0] * b^(m-1)) * b + s[m]

To avoid working with very large numbers, we typically use modular arithmetic with a large prime number.

Algorithm Steps

  1. Calculate the hash value of the pattern
  2. Calculate the hash value of the first window of text (same length as the pattern)
  3. For each possible position in the text:
    • Compare the hash values of the pattern and the current text window
    • If the hashes match, verify that the characters actually match
    • Slide the window by one character and update the hash value

Implementation

Here's a Python implementation of the Rabin-Karp algorithm:

python
def rabin_karp(text, pattern):
# Base values for the rolling hash
base = 256
prime = 101

n = len(text)
m = len(pattern)
results = []

# Calculate hash value for pattern and first window of text
pattern_hash = 0
text_hash = 0
highest_power = 1

# Calculate the value of base^(m-1) % prime
for i in range(m-1):
highest_power = (highest_power * base) % prime

# Calculate initial hash values
for i in range(m):
pattern_hash = (base * pattern_hash + ord(pattern[i])) % prime
text_hash = (base * text_hash + ord(text[i])) % prime

# Slide the pattern over text one by one
for i in range(n - m + 1):
# Check if the hash values match
if pattern_hash == text_hash:
# Verify character by character
match = True
for j in range(m):
if text[i + j] != pattern[j]:
match = False
break
if match:
results.append(i)

# Calculate hash value for next window
if i < n - m:
text_hash = (base * (text_hash - ord(text[i]) * highest_power) + ord(text[i + m])) % prime
# We might get negative value, converting it to positive
if text_hash < 0:
text_hash += prime

return results

Example Usage

python
text = "AABABACABABABA"
pattern = "ABABABA"
matches = rabin_karp(text, pattern)
print(f"Pattern found at positions: {matches}")

# Output: Pattern found at positions: [7]

Step-by-Step Example

Let's trace through a simple example:

  • Text: "AABABAC"
  • Pattern: "ABA"
  • Base (b): 256
  • Prime: 101

Step 1: Calculate hash values

  • Pattern "ABA" hash: (65 * 256^2 + 66 * 256^1 + 65 * 256^0) % 101 = 67
  • First window "AAB" hash: (65 * 256^2 + 65 * 256^1 + 66 * 256^0) % 101 = 31

Step 2: Start sliding window

  1. Compare hash of "AAB" (31) with pattern (67) → Different, continue
  2. Calculate rolling hash for "ABA":
    • Remove 'A': (31 - 65 * 256^2) * 256
    • Add 'A': + 65
    • Apply modulo: % 101
    • Result: 67
  3. Compare hash of "ABA" (67) with pattern (67) → Same! Check characters:
    • "ABA" matches "ABA" → Found at position 1
  4. Continue this process...

Real-World Applications

1. Plagiarism Detection

Rabin-Karp is commonly used in plagiarism detection software. By breaking documents into smaller chunks and hashing them, the algorithm can quickly identify matching sections between documents.

python
def check_plagiarism(document1, document2, chunk_size=5):
doc1_chunks = []
# Create chunks of document1
for i in range(len(document1) - chunk_size + 1):
doc1_chunks.append(document1[i:i+chunk_size])

matching_chunks = 0
# Check each possible chunk of document2
for i in range(len(document2) - chunk_size + 1):
chunk = document2[i:i+chunk_size]
if chunk in doc1_chunks:
matching_chunks += 1

# Calculate similarity percentage
total_possible_chunks = len(document1) - chunk_size + 1 + len(document2) - chunk_size + 1
similarity = (matching_chunks * 2 / total_possible_chunks) * 100
return similarity

2. DNA Sequence Matching

In bioinformatics, Rabin-Karp is used to find specific sequences within DNA strings:

python
def find_dna_sequence(genome, target_sequence):
positions = rabin_karp(genome, target_sequence)
if positions:
return f"Target DNA sequence found at positions: {positions}"
else:
return "Target DNA sequence not found in the genome"

3. Multiple Pattern Matching

One of Rabin-Karp's advantages is its ability to search for multiple patterns in one pass:

python
def multi_pattern_rabin_karp(text, patterns):
results = {}
pattern_hashes = {}

# Pre-compute all pattern hashes
for pattern in patterns:
pattern_hash = 0
for char in pattern:
pattern_hash = (256 * pattern_hash + ord(char)) % 101
pattern_hashes[pattern] = pattern_hash
results[pattern] = []

# Search for each pattern
for pattern in patterns:
matches = rabin_karp(text, pattern)
results[pattern] = matches

return results

Performance Analysis

  • Time Complexity:

    • Average case: O(n + m) where n is the length of the text and m is the length of the pattern
    • Worst case: O(n*m) when there are many hash collisions
  • Space Complexity: O(1) for the standard algorithm, O(k) when searching for k patterns

The key advantage of Rabin-Karp over naive string matching is its ability to skip unnecessary character comparisons using the hashing technique.

Limitations and Considerations

  1. Hash Collisions: Different strings can have the same hash value, requiring character-by-character verification
  2. Base and Prime Selection: The choice of base and prime affects performance and collision frequency
  3. Rolling Hash Precision: With very large texts, there's a risk of hash value overflow

Summary

The Rabin-Karp algorithm provides an efficient approach to string matching through smart use of hash functions. It performs especially well when:

  • Searching for multiple patterns simultaneously
  • Working with applications like plagiarism detection
  • Pattern matching is a frequent operation

The algorithm's ability to compute hashes in constant time when sliding the window makes it substantially faster than naive approaches in many real-world scenarios.

Exercises

  1. Implement the Rabin-Karp algorithm to find all occurrences of a pattern in a text.
  2. Modify the algorithm to search for multiple patterns in a single text in one pass.
  3. Experiment with different base and prime values. How do they affect the algorithm's performance?
  4. Implement a simple plagiarism detector using Rabin-Karp to compare two documents.
  5. Compare the performance of Rabin-Karp with other string matching algorithms like Naive, KMP, and Boyer-Moore for different input sizes.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)