Skip to main content

String Hashing

Introduction

String hashing is a powerful technique that converts strings into numerical values (hash codes), allowing us to perform string operations more efficiently. Instead of comparing strings character by character, which can be time-consuming for large texts, we can compare their hash values in constant time.

This technique forms the foundation for many string algorithms, including efficient substring searching, determining string equality, and finding duplicate strings in a large dataset.

Why String Hashing?

Comparing two strings of length n typically requires O(n) time. However, if we convert strings to hash values, we can compare them in O(1) time. This significant speed improvement makes string hashing particularly valuable for:

  • Substring search problems
  • Checking string equality
  • Finding duplicate strings
  • Implementing string-based data structures like hash maps

How String Hashing Works

At its core, string hashing treats a string as a number in some base (usually a large prime). Let's understand the concept with a simple polynomial hash function.

Polynomial Hash Function

A polynomial hash converts a string into a number using the formula:

H(s) = s[0] + s[1]*p + s[2]*p² + ... + s[n-1]*p^(n-1) mod m

Where:

  • s[i] is the ASCII (or other encoding) value of the ith character
  • p is a prime number (typically 31 for lowercase letters, 53 for mixed case)
  • m is a large modulo value to prevent integer overflow

Let's implement this hash function in code:

cpp
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;

for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}

return hash_value;
}

In Python:

python
def compute_hash(s):
p = 31
m = 10**9 + 9
hash_value = 0
p_pow = 1

for c in s:
hash_value = (hash_value + (ord(c) - ord('a') + 1) * p_pow) % m
p_pow = (p_pow * p) % m

return hash_value

Handling Collisions

A collision occurs when two different strings produce the same hash value. While this is theoretically unavoidable (due to the pigeonhole principle), we can minimize collisions by:

  1. Choosing appropriate values for p and m
  2. Using double hashing (computing two different hash functions)
  3. Using a modulo value that's a large prime number

For most competitive programming and practical applications, a single well-chosen hash function with large modulo is sufficient.

Prefix Hash Tables

An extremely useful application of string hashing is computing hash values for all prefixes of a string, which allows us to calculate the hash of any substring in O(1) time.

cpp
vector<long long> compute_prefix_hashes(string const& s) {
const int p = 31;
const int m = 1e9 + 9;

vector<long long> hash_values(s.length());
long long p_pow = 1;

for (int i = 0; i < s.length(); i++) {
hash_values[i] = ((i > 0 ? hash_values[i-1] : 0) + (s[i] - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}

return hash_values;
}

Using this prefix hash table, we can compute the hash of a substring s[l...r] in O(1) time:

cpp
long long get_substring_hash(vector<long long> const& prefix_hashes, vector<long long> const& powers, int l, int r, int m) {
if (l == 0)
return prefix_hashes[r];

long long result = prefix_hashes[r] - prefix_hashes[l-1];
if (result < 0)
result += m;

// We need to divide by p^l
result = (result * powers[l]) % m;
return result;
}

Note that we need to precompute powers of p with modular inverses to perform efficient division in modular arithmetic.

Practical Applications

1. The Rabin-Karp Algorithm

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find patterns in strings. It can find all occurrences of a pattern in a text in O(n+m) expected time, where n is the length of the text and m is the length of the pattern.

Here's a simple implementation:

cpp
vector<int> rabin_karp(string const& text, string const& pattern) {
const int p = 31;
const int m = 1e9 + 9;
int S = pattern.length();
int T = text.length();

vector<long long> p_pow(max(S, T));
p_pow[0] = 1;
for (int i = 1; i < p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;

// Compute pattern hash
long long pattern_hash = 0;
for (int i = 0; i < S; i++)
pattern_hash = (pattern_hash + (pattern[i] - 'a' + 1) * p_pow[i]) % m;

// Compute text prefix hashes
vector<long long> text_hashes(T + 1, 0);
for (int i = 0; i < T; i++)
text_hashes[i+1] = (text_hashes[i] + (text[i] - 'a' + 1) * p_pow[i]) % m;

vector<int> occurrences;
// Check for matches
for (int i = 0; i + S - 1 < T; i++) {
long long curr_hash = (text_hashes[i+S] - text_hashes[i] + m) % m;
if (curr_hash == (pattern_hash * p_pow[i]) % m)
occurrences.push_back(i);
}

return occurrences;
}

Example usage:

cpp
string text = "abracadabra";
string pattern = "abra";
vector<int> matches = rabin_karp(text, pattern);

cout << "Pattern found at positions: ";
for (int pos : matches) {
cout << pos << " ";
}
// Output: Pattern found at positions: 0 7

2. Finding Duplicate Substrings

String hashing can be used to efficiently find duplicate substrings in a text. Here's an example of finding the longest duplicate substring:

cpp
string longest_duplicate_substring(string s) {
int n = s.length();
const int p = 31;
const int m = 1e9 + 9;

vector<long long> p_pow(n);
p_pow[0] = 1;
for (int i = 1; i < n; i++)
p_pow[i] = (p_pow[i-1] * p) % m;

// Binary search on the length of the substring
int left = 1, right = n;
string result = "";

while (left <= right) {
int mid = (left + right) / 2;
bool found = false;
string candidate = "";

// Use a hash map to find duplicates of length mid
unordered_map<long long, int> hash_pos;
long long curr_hash = 0;

for (int i = 0; i < n; i++) {
// Remove the leftmost character if we've reached mid characters
if (i >= mid)
curr_hash = (curr_hash - ((s[i-mid] - 'a' + 1) * p_pow[mid-1]) % m + m) % m;

// Add the current character
curr_hash = (curr_hash * p) % m;
curr_hash = (curr_hash + (s[i] - 'a' + 1)) % m;

// If we have exactly mid characters
if (i >= mid - 1) {
if (hash_pos.count(curr_hash)) {
// Confirm it's not a hash collision
string potential = s.substr(i - mid + 1, mid);
string existing = s.substr(hash_pos[curr_hash], mid);

if (potential == existing) {
found = true;
candidate = potential;
break;
}
} else {
hash_pos[curr_hash] = i - mid + 1;
}
}
}

if (found) {
if (candidate.length() > result.length())
result = candidate;
left = mid + 1;
} else {
right = mid - 1;
}
}

return result;
}

Example usage:

cpp
string s = "banana";
cout << "Longest duplicate substring: " << longest_duplicate_substring(s);
// Output: Longest duplicate substring: ana

3. String Similarity Checking

String hashing can be used to implement efficient algorithms for finding commonalities between strings, such as the longest common substring:

cpp
string longest_common_substring(string s1, string s2) {
// Try binary search on the length of the common substring
int left = 0, right = min(s1.length(), s2.length());
string result = "";

while (left <= right) {
int mid = (left + right) / 2;
string common = check_substring_of_length(s1, s2, mid);

if (!common.empty()) {
if (common.length() > result.length())
result = common;
left = mid + 1;
} else {
right = mid - 1;
}
}

return result;
}

// Helper function that checks if there's a common substring of length k
string check_substring_of_length(string s1, string s2, int k) {
if (k == 0) return "";

const int p = 31;
const int m = 1e9 + 9;

// Compute p^k-1
long long p_pow = 1;
for (int i = 0; i < k - 1; i++)
p_pow = (p_pow * p) % m;

// Compute hashes for all k-length substrings of s1
unordered_map<long long, int> hash_pos;
long long curr_hash = 0;

for (int i = 0; i < s1.length(); i++) {
if (i >= k)
curr_hash = (curr_hash - ((s1[i-k] - 'a' + 1) * p_pow) % m + m) % m;

curr_hash = (curr_hash * p) % m;
curr_hash = (curr_hash + (s1[i] - 'a' + 1)) % m;

if (i >= k - 1)
hash_pos[curr_hash] = i - k + 1;
}

// Check all k-length substrings of s2
curr_hash = 0;
for (int i = 0; i < s2.length(); i++) {
if (i >= k)
curr_hash = (curr_hash - ((s2[i-k] - 'a' + 1) * p_pow) % m + m) % m;

curr_hash = (curr_hash * p) % m;
curr_hash = (curr_hash + (s2[i] - 'a' + 1)) % m;

if (i >= k - 1 && hash_pos.count(curr_hash)) {
// Confirm it's not a hash collision
string s1_substr = s1.substr(hash_pos[curr_hash], k);
string s2_substr = s2.substr(i - k + 1, k);

if (s1_substr == s2_substr)
return s1_substr;
}
}

return "";
}

Example usage:

cpp
string s1 = "programming";
string s2 = "grammatical";
cout << "Longest common substring: " << longest_common_substring(s1, s2);
// Output: Longest common substring: gramm

Choosing Good Hash Parameters

The quality of your hash function depends greatly on the choice of base (p) and modulo (m). Here are some guidelines:

  1. Choose p to be greater than the size of your alphabet. For lowercase English letters, p = 31 works well.
  2. Choose m to be a large prime number. Common choices include 10^9 + 9, 10^9 + 7, or 2^64 (when working with unsigned 64-bit integers).
  3. To further reduce collision probability, consider using multiple hash functions with different values of p and m.

Double Hashing for Added Security

For critical applications where collision resistance is paramount, you can implement double hashing by calculating two different hash values using different parameters:

cpp
pair<long long, long long> double_hash(string const& s) {
const int p1 = 31, p2 = 53;
const int m1 = 1e9 + 9, m2 = 1e9 + 7;
long long hash1 = 0, hash2 = 0;
long long p_pow1 = 1, p_pow2 = 1;

for (char c : s) {
hash1 = (hash1 + (c - 'a' + 1) * p_pow1) % m1;
p_pow1 = (p_pow1 * p1) % m1;

hash2 = (hash2 + (c - 'a' + 1) * p_pow2) % m2;
p_pow2 = (p_pow2 * p2) % m2;
}

return {hash1, hash2};
}

This approach drastically reduces the chance of hash collisions, as two different strings would need to collide in both hash functions.

Summary

String hashing is a powerful technique that allows us to:

  1. Compare strings in constant time
  2. Efficiently search for substrings (Rabin-Karp algorithm)
  3. Find duplicate substrings
  4. Check string similarity
  5. Implement efficient string-based data structures

The key benefits are:

  • Converting O(n) string comparisons to O(1) hash comparisons
  • Enabling fast substring operations with prefix hash tables
  • Providing a foundation for many advanced string algorithms

Remember that while hash collisions are theoretically possible, with proper choice of parameters, the probability becomes negligible for most practical applications.

Practice Exercises

  1. Implement a function that finds all occurrences of a pattern in a text using the Rabin-Karp algorithm.
  2. Write a program that finds the longest repeated substring in a given string.
  3. Implement a function to check if one string is a rotation of another using string hashing.
  4. Create a data structure that can efficiently check if a string is a palindrome from any range [L, R].
  5. Implement a function that finds the lexicographically smallest rotation of a string using hashing.

Additional Resources

By understanding string hashing, you've added a powerful tool to your algorithmic toolkit that will help you solve many complex string problems efficiently!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)