Levenshtein Distance
Introduction
Levenshtein distance, also known as edit distance, is a string metric for measuring the difference between two sequences. It calculates the minimum number of single-character operations (insertions, deletions, or substitutions) required to change one string into another.
Named after Vladimir Levenshtein, who developed it in 1965, this algorithm has become fundamental in various applications including:
- Spell checking and correction
- DNA sequence analysis
- Plagiarism detection
- Fuzzy string matching
- Speech recognition
In this tutorial, we'll explore the concept of Levenshtein distance, understand how it works, and implement it in code.
Understanding Levenshtein Distance
The Levenshtein distance between two strings a
and b
is the minimum number of operations needed to transform string a
into string b
, where the allowed operations are:
- Insertion: Add a character to string
a
- Deletion: Remove a character from string
a
- Substitution: Replace a character in string
a
with a different character
Let's look at a simple example:
Consider the strings "kitten"
and "sitting"
:
- Replace 'k' with 's': "sitten"
- Replace 'e' with 'i': "sittin"
- Insert 'g' at the end: "sitting"
The Levenshtein distance between "kitten" and "sitting" is 3, as we need at least 3 operations to transform one into the other.
The Algorithm
The Levenshtein algorithm uses dynamic programming to calculate the minimum edit distance. We'll build a matrix where each cell [i][j]
represents the Levenshtein distance between the first i
characters of string a
and the first j
characters of string b
.
The algorithm follows these steps:
- Create a matrix of size
(len(a)+1) x (len(b)+1)
- Initialize the first row and column with incremental values (0, 1, 2, ...)
- For each cell
[i][j]
in the matrix (excluding the first row and column):- If the characters
a[i-1]
andb[j-1]
match, the cost is 0; otherwise, the cost is 1 - Calculate the minimum of:
- The value above plus 1 (deletion)
- The value to the left plus 1 (insertion)
- The value diagonally above-left plus the cost (substitution)
- If the characters
- The bottom-right cell contains the Levenshtein distance between the two strings
Let's visualize this with a matrix for "kitten" and "sitting":
The final value in the bottom-right corner (3) is the Levenshtein distance.
Implementation in JavaScript
Let's implement the Levenshtein distance algorithm in JavaScript:
function levenshteinDistance(str1, str2) {
// Create a matrix of size (str1.length+1) x (str2.length+1)
const dp = Array(str1.length + 1)
.fill(null)
.map(() => Array(str2.length + 1).fill(null));
// Initialize first row and column
for (let i = 0; i <= str1.length; i++) {
dp[i][0] = i;
}
for (let j = 0; j <= str2.length; j++) {
dp[0][j] = j;
}
// Fill the matrix
for (let i = 1; i <= str1.length; i++) {
for (let j = 1; j <= str2.length; j++) {
// If characters match, cost is 0, otherwise 1
const cost = str1[i - 1] === str2[j - 1] ? 0 : 1;
dp[i][j] = Math.min(
dp[i - 1][j] + 1, // deletion
dp[i][j - 1] + 1, // insertion
dp[i - 1][j - 1] + cost // substitution
);
}
}
// Return the bottom-right cell
return dp[str1.length][str2.length];
}
// Example usage
console.log(levenshteinDistance("kitten", "sitting")); // Output: 3
console.log(levenshteinDistance("hello", "hallo")); // Output: 1
console.log(levenshteinDistance("algorithm", "logarithm")); // Output: 3
Let's also implement it in Python for comparison:
def levenshtein_distance(str1, str2):
# Create a matrix of size (len(str1)+1) x (len(str2)+1)
dp = [[0 for _ in range(len(str2) + 1)] for _ in range(len(str1) + 1)]
# Initialize first row and column
for i in range(len(str1) + 1):
dp[i][0] = i
for j in range(len(str2) + 1):
dp[0][j] = j
# Fill the matrix
for i in range(1, len(str1) + 1):
for j in range(1, len(str2) + 1):
# If characters match, cost is 0, otherwise 1
cost = 0 if str1[i-1] == str2[j-1] else 1
dp[i][j] = min(
dp[i-1][j] + 1, # deletion
dp[i][j-1] + 1, # insertion
dp[i-1][j-1] + cost # substitution
)
# Return the bottom-right cell
return dp[len(str1)][len(str2)]
# Example usage
print(levenshtein_distance("kitten", "sitting")) # Output: 3
print(levenshtein_distance("hello", "hallo")) # Output: 1
print(levenshtein_distance("algorithm", "logarithm")) # Output: 3
Time and Space Complexity
- Time Complexity: O(m × n), where m and n are the lengths of the two strings. This is because we need to fill a matrix of size (m+1) × (n+1).
- Space Complexity: O(m × n) for the same reason.
Optimizations
Space Optimization
Since we only need the current and previous rows to calculate the new values, we can optimize the space complexity to O(min(m, n)):
function levenshteinDistanceOptimized(str1, str2) {
// Make sure str1 is the shorter string for space optimization
if (str1.length > str2.length) {
[str1, str2] = [str2, str1];
}
let prevRow = Array(str1.length + 1).fill(0);
let currRow = Array(str1.length + 1).fill(0);
// Initialize the first row
for (let i = 0; i <= str1.length; i++) {
prevRow[i] = i;
}
// Fill the matrix one row at a time
for (let j = 1; j <= str2.length; j++) {
currRow[0] = j;
for (let i = 1; i <= str1.length; i++) {
const cost = str1[i - 1] === str2[j - 1] ? 0 : 1;
currRow[i] = Math.min(
prevRow[i] + 1, // deletion
currRow[i - 1] + 1, // insertion
prevRow[i - 1] + cost // substitution
);
}
// Swap rows for next iteration
[prevRow, currRow] = [currRow, prevRow];
}
return prevRow[str1.length];
}
Real-world Applications
1. Spell Checker
A spell checker can use Levenshtein distance to suggest corrections for misspelled words. Here's a simple example:
function spellChecker(word, dictionary, maxDistance = 2) {
const suggestions = [];
for (const dictWord of dictionary) {
const distance = levenshteinDistance(word, dictWord);
if (distance <= maxDistance) {
suggestions.push({ word: dictWord, distance });
}
}
return suggestions.sort((a, b) => a.distance - b.distance);
}
// Example
const dictionary = ["apple", "banana", "orange", "pear", "apricot", "application"];
console.log(spellChecker("appel", dictionary));
// Output: [{ word: "apple", distance: 1 }]
2. DNA Sequence Analysis
Biologists use Levenshtein distance to measure the similarity between DNA sequences:
function dnaSequenceSimilarity(seq1, seq2) {
const distance = levenshteinDistance(seq1, seq2);
// Calculate similarity as a percentage
const maxLength = Math.max(seq1.length, seq2.length);
const similarity = ((maxLength - distance) / maxLength) * 100;
return similarity.toFixed(2) + '%';
}
// Example
const dna1 = "ACGTACGT";
const dna2 = "ACGTTCGT";
console.log(dnaSequenceSimilarity(dna1, dna2));
// Output: "87.50%"
3. Fuzzy Search
Implement a simple fuzzy search system that finds entries in a database that are similar to a query:
function fuzzySearch(query, items, threshold = 3) {
return items.filter(item =>
levenshteinDistance(query.toLowerCase(), item.toLowerCase()) <= threshold
);
}
// Example
const fruits = ["apple", "banana", "blueberry", "cherry", "dragonfruit", "grape"];
console.log(fuzzySearch("banan", fruits));
// Output: ["banana"]
Summary
The Levenshtein distance algorithm is a powerful tool for measuring the similarity between strings. It works by calculating the minimum number of single-character edits required to transform one string into another.
Key points to remember:
- The algorithm uses dynamic programming to build a solution incrementally
- The three allowed operations are insertion, deletion, and substitution
- The time complexity is O(m × n) and space complexity is O(m × n), but we can optimize the space to O(min(m, n))
- It has numerous real-world applications including spell checking, DNA analysis, and fuzzy string matching
Exercises
- Implement the Levenshtein distance algorithm in your preferred programming language.
- Extend the algorithm to output not just the distance but the actual sequence of operations needed to transform one string into another.
- Create a more sophisticated spell checker that uses the Levenshtein distance.
- Compare the performance of the standard Levenshtein algorithm with the space-optimized version for very long strings.
- Modify the algorithm to assign different costs to different operations (e.g., make substitution cost 2, while insertion and deletion cost 1).
Additional Resources
Understanding the Levenshtein distance algorithm is a solid foundation for exploring more complex string matching and comparison techniques. It's a great example of how dynamic programming can efficiently solve problems that would otherwise require exponential time to compute.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)