Levenshtein Distance

Introduction

Levenshtein distance, also known as edit distance, is a string metric for measuring the difference between two sequences. It calculates the minimum number of single-character operations (insertions, deletions, or substitutions) required to change one string into another.

Named after Vladimir Levenshtein, who developed it in 1965, this algorithm has become fundamental in various applications including:

Spell checking and correction
DNA sequence analysis
Plagiarism detection
Fuzzy string matching
Speech recognition

In this tutorial, we'll explore the concept of Levenshtein distance, understand how it works, and implement it in code.

Understanding Levenshtein Distance

The Levenshtein distance between two strings a and b is the minimum number of operations needed to transform string a into string b, where the allowed operations are:

Insertion: Add a character to string a
Deletion: Remove a character from string a
Substitution: Replace a character in string a with a different character

Let's look at a simple example:

Consider the strings "kitten" and "sitting":

Replace 'k' with 's': "sitten"
Replace 'e' with 'i': "sittin"
Insert 'g' at the end: "sitting"

The Levenshtein distance between "kitten" and "sitting" is 3, as we need at least 3 operations to transform one into the other.

The Algorithm

The Levenshtein algorithm uses dynamic programming to calculate the minimum edit distance. We'll build a matrix where each cell [i][j] represents the Levenshtein distance between the first i characters of string a and the first j characters of string b.

The algorithm follows these steps:

Create a matrix of size (len(a)+1) x (len(b)+1)
Initialize the first row and column with incremental values (0, 1, 2, ...)
For each cell [i][j] in the matrix (excluding the first row and column):
- If the characters a[i-1] and b[j-1] match, the cost is 0; otherwise, the cost is 1
- Calculate the minimum of:
  - The value above plus 1 (deletion)
  - The value to the left plus 1 (insertion)
  - The value diagonally above-left plus the cost (substitution)
The bottom-right cell contains the Levenshtein distance between the two strings

Let's visualize this with a matrix for "kitten" and "sitting":

The final value in the bottom-right corner (3) is the Levenshtein distance.

Implementation in JavaScript

Let's implement the Levenshtein distance algorithm in JavaScript:

function levenshteinDistance(str1, str2) {
  // Create a matrix of size (str1.length+1) x (str2.length+1)
  const dp = Array(str1.length + 1)
    .fill(null)
    .map(() => Array(str2.length + 1).fill(null));

  // Initialize first row and column
  for (let i = 0; i <= str1.length; i++) {
    dp[i][0] = i;
  }
  for (let j = 0; j <= str2.length; j++) {
    dp[0][j] = j;
  }

  // Fill the matrix
  for (let i = 1; i <= str1.length; i++) {
    for (let j = 1; j <= str2.length; j++) {
      // If characters match, cost is 0, otherwise 1
      const cost = str1[i - 1] === str2[j - 1] ? 0 : 1;
      
      dp[i][j] = Math.min(
        dp[i - 1][j] + 1,      // deletion
        dp[i][j - 1] + 1,      // insertion
        dp[i - 1][j - 1] + cost // substitution
      );
    }
  }

  // Return the bottom-right cell
  return dp[str1.length][str2.length];
}

// Example usage
console.log(levenshteinDistance("kitten", "sitting"));  // Output: 3
console.log(levenshteinDistance("hello", "hallo"));    // Output: 1
console.log(levenshteinDistance("algorithm", "logarithm")); // Output: 3

Let's also implement it in Python for comparison:

def levenshtein_distance(str1, str2):
    # Create a matrix of size (len(str1)+1) x (len(str2)+1)
    dp = [[0 for _ in range(len(str2) + 1)] for _ in range(len(str1) + 1)]
    
    # Initialize first row and column
    for i in range(len(str1) + 1):
        dp[i][0] = i
    for j in range(len(str2) + 1):
        dp[0][j] = j
        
    # Fill the matrix
    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            # If characters match, cost is 0, otherwise 1
            cost = 0 if str1[i-1] == str2[j-1] else 1
            
            dp[i][j] = min(
                dp[i-1][j] + 1,      # deletion
                dp[i][j-1] + 1,      # insertion
                dp[i-1][j-1] + cost  # substitution
            )
    
    # Return the bottom-right cell
    return dp[len(str1)][len(str2)]

# Example usage
print(levenshtein_distance("kitten", "sitting"))  # Output: 3
print(levenshtein_distance("hello", "hallo"))    # Output: 1
print(levenshtein_distance("algorithm", "logarithm")) # Output: 3

Time and Space Complexity

Time Complexity: O(m × n), where m and n are the lengths of the two strings. This is because we need to fill a matrix of size (m+1) × (n+1).
Space Complexity: O(m × n) for the same reason.

Optimizations

Space Optimization

Since we only need the current and previous rows to calculate the new values, we can optimize the space complexity to O(min(m, n)):

function levenshteinDistanceOptimized(str1, str2) {
  // Make sure str1 is the shorter string for space optimization
  if (str1.length > str2.length) {
    [str1, str2] = [str2, str1];
  }

  let prevRow = Array(str1.length + 1).fill(0);
  let currRow = Array(str1.length + 1).fill(0);
  
  // Initialize the first row
  for (let i = 0; i <= str1.length; i++) {
    prevRow[i] = i;
  }
  
  // Fill the matrix one row at a time
  for (let j = 1; j <= str2.length; j++) {
    currRow[0] = j;
    
    for (let i = 1; i <= str1.length; i++) {
      const cost = str1[i - 1] === str2[j - 1] ? 0 : 1;
      currRow[i] = Math.min(
        prevRow[i] + 1,      // deletion
        currRow[i - 1] + 1,  // insertion
        prevRow[i - 1] + cost // substitution
      );
    }
    
    // Swap rows for next iteration
    [prevRow, currRow] = [currRow, prevRow];
  }
  
  return prevRow[str1.length];
}

Real-world Applications

1. Spell Checker

A spell checker can use Levenshtein distance to suggest corrections for misspelled words. Here's a simple example:

function spellChecker(word, dictionary, maxDistance = 2) {
  const suggestions = [];
  
  for (const dictWord of dictionary) {
    const distance = levenshteinDistance(word, dictWord);
    if (distance <= maxDistance) {
      suggestions.push({ word: dictWord, distance });
    }
  }
  
  return suggestions.sort((a, b) => a.distance - b.distance);
}

// Example
const dictionary = ["apple", "banana", "orange", "pear", "apricot", "application"];
console.log(spellChecker("appel", dictionary));
// Output: [{ word: "apple", distance: 1 }]

2. DNA Sequence Analysis

Biologists use Levenshtein distance to measure the similarity between DNA sequences:

function dnaSequenceSimilarity(seq1, seq2) {
  const distance = levenshteinDistance(seq1, seq2);
  // Calculate similarity as a percentage
  const maxLength = Math.max(seq1.length, seq2.length);
  const similarity = ((maxLength - distance) / maxLength) * 100;
  return similarity.toFixed(2) + '%';
}

// Example
const dna1 = "ACGTACGT";
const dna2 = "ACGTTCGT";
console.log(dnaSequenceSimilarity(dna1, dna2));
// Output: "87.50%"

3. Fuzzy Search

Implement a simple fuzzy search system that finds entries in a database that are similar to a query:

function fuzzySearch(query, items, threshold = 3) {
  return items.filter(item => 
    levenshteinDistance(query.toLowerCase(), item.toLowerCase()) <= threshold
  );
}

// Example
const fruits = ["apple", "banana", "blueberry", "cherry", "dragonfruit", "grape"];
console.log(fuzzySearch("banan", fruits));
// Output: ["banana"]

Summary

The Levenshtein distance algorithm is a powerful tool for measuring the similarity between strings. It works by calculating the minimum number of single-character edits required to transform one string into another.

Key points to remember:

The algorithm uses dynamic programming to build a solution incrementally
The three allowed operations are insertion, deletion, and substitution
The time complexity is O(m × n) and space complexity is O(m × n), but we can optimize the space to O(min(m, n))
It has numerous real-world applications including spell checking, DNA analysis, and fuzzy string matching

Exercises

Implement the Levenshtein distance algorithm in your preferred programming language.
Extend the algorithm to output not just the distance but the actual sequence of operations needed to transform one string into another.
Create a more sophisticated spell checker that uses the Levenshtein distance.
Compare the performance of the standard Levenshtein algorithm with the space-optimized version for very long strings.
Modify the algorithm to assign different costs to different operations (e.g., make substitution cost 2, while insertion and deletion cost 1).

Additional Resources

Understanding the Levenshtein distance algorithm is a solid foundation for exploring more complex string matching and comparison techniques. It's a great example of how dynamic programming can efficiently solve problems that would otherwise require exponential time to compute.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Levenshtein Distance​

The Algorithm​

Implementation in JavaScript​

Time and Space Complexity​

Optimizations​

Space Optimization​

Real-world Applications​

1. Spell Checker​

2. DNA Sequence Analysis​

3. Fuzzy Search​

Summary​

Exercises​

Additional Resources​