PostgreSQL pg_trgm Extension

Introduction

Have you ever needed to implement a search feature that finds results even when users make typos? Or perhaps you wanted to find records with names that sound similar but are spelled differently? The pg_trgm extension in PostgreSQL is designed precisely for these scenarios.

The pg_trgm (trigram) extension provides functions and operators for determining the similarity of text based on trigram matching. A trigram is a group of three consecutive characters taken from a string. By comparing the number of trigrams that two strings share, we can determine how similar they are.

In this tutorial, we'll explore how to:

Install and enable the pg_trgm extension
Use similarity functions and operators
Create efficient indexes for text search
Implement practical applications like fuzzy search

Enabling the pg_trgm Extension

Before using pg_trgm functions, you need to enable the extension in your database:

CREATE EXTENSION pg_trgm;

You can verify the installation with:

SELECT * FROM pg_extension WHERE extname = 'pg_trgm';

Expected output:

  extname  | extowner |  extnamespace  | extrelocatable | extversion | extconfig | extcondition
-----------+----------+----------------+----------------+------------+-----------+--------------
 pg_trgm   |       10 |            2200 | f              | 1.6        |           |

Understanding Trigrams

Let's understand what trigrams are using the show_trgm function:

SELECT show_trgm('hello');

Expected output:

              show_trgm
--------------------------------------
 {"  h"," he",ell,hel,llo,"lo "}

The output shows all the trigrams for the word "hello". Note that PostgreSQL pads the beginning and end of the string with two spaces when generating trigrams, which is why you see trigrams like " h" and "lo ".

Similarity Functions

The pg_trgm extension provides several key functions:

1. similarity()

This function returns a number between 0 and 1 that indicates how similar two strings are. The higher the number, the more similar the strings:

SELECT similarity('hello', 'helo');

Expected output:

 similarity
------------
      0.5

2. word_similarity()

Similar to similarity(), but focuses on matching whole words rather than the entire text:

SELECT word_similarity('hello world', 'hello');

Expected output:

 word_similarity
----------------
              1

3. strict_word_similarity()

A stricter version of word_similarity() that requires more precise matching:

SELECT strict_word_similarity('hello world', 'hello');

Expected output:

 strict_word_similarity
-----------------------
                   0.8

Similarity Operators

The pg_trgm extension also provides several operators:

Operator	Description
`%`	Similarity operator (returns true if similarity is greater than the current threshold)
`<->`	Distance operator (returns the "distance" between strings, which is 1 - similarity)
`<<%`	Word similarity prefix operator
`%>>`	Word similarity suffix operator
`<%`	Strict word similarity prefix operator
`%>`	Strict word similarity suffix operator

Example:

SELECT 'hello' % 'helo';

Expected output:

 ?column?
----------
 t

This indicates that "hello" and "helo" are considered similar according to the current similarity threshold.

Setting the Similarity Threshold

You can adjust the similarity threshold that determines when strings are considered a match:

-- View the current threshold
SELECT show_limit();

-- Set a new threshold (0 to 1)
SELECT set_limit(0.3);

The default threshold is typically 0.3, meaning strings with 30% or more similarity will be considered matches.

Creating a Table for Examples

Let's create a sample table to demonstrate pg_trgm in action:

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    description TEXT
);

INSERT INTO products (name, description) VALUES
('Smartphone X1', 'High-end smartphone with 6.5" display'),
('Smartfone X2', 'Premium smartphone with 6.7" display'),
('Smart Phone X3', 'Budget friendly smart phone with 6.1" display'),
('Laptop Pro', 'Professional laptop for developers'),
('Tablet Ultra', 'Ultra-thin tablet for professionals'),
('Tablet Mini', 'Compact tablet for reading and browsing'),
('SmartWatch', 'Fitness tracking smart watch with heart rate monitor'),
('Smart TV', '4K Smart television with AI features'),
('Phone Case', 'Protective case for smartphones'),
('Phone Charger', 'Fast charging adapter for phones');

Creating Trigram Indexes

One of the most powerful features of pg_trgm is its ability to create indexes for text similarity queries. These indexes can dramatically improve the performance of similarity searches.

There are two main types of indexes you can create using pg_trgm:

GIN Index (Generalized Inverted Index)

Optimized for static data with less frequent updates:

CREATE INDEX idx_products_name_gin ON products USING gin (name gin_trgm_ops);
CREATE INDEX idx_products_description_gin ON products USING gin (description gin_trgm_ops);

GiST Index (Generalized Search Tree)

Better for data that changes frequently but slightly slower for searches:

CREATE INDEX idx_products_name_gist ON products USING gist (name gist_trgm_ops);

Practical Examples

Now let's explore some practical applications of pg_trgm:

Fuzzy Search

Find products even when users misspell the search term:

-- Search for "smartphone" with potential misspellings
SELECT name, similarity(name, 'smertphone') AS sim
FROM products
WHERE name % 'smertphone'
ORDER BY sim DESC;

Expected output:

      name       |  sim  
-----------------+-------
 Smartphone X1   | 0.421
 Smartfone X2    | 0.368
 Smart Phone X3  | 0.333

Finding Similar Product Descriptions

-- Find products with descriptions similar to a given phrase
SELECT name, description, similarity(description, 'professional tablet') AS sim
FROM products
WHERE description % 'professional tablet'
ORDER BY sim DESC;

Expected output:

     name     |              description              |  sim  
--------------+---------------------------------------+-------
 Tablet Ultra | Ultra-thin tablet for professionals   | 0.542
 Tablet Mini  | Compact tablet for reading and browsing | 0.346

Autocomplete Implementation

You can implement autocomplete functionality using the word similarity operators:

-- Autocomplete based on prefix matching
SELECT name
FROM products
WHERE name <<% 'Smar'
ORDER BY similarity(name, 'Smar') DESC;

Expected output:

      name      
----------------
 Smartphone X1
 Smartfone X2
 Smart Phone X3
 SmartWatch
 Smart TV

Combining with Regular Expressions

For even more powerful searches, you can combine pg_trgm with regular expressions:

-- Find products where the name contains the word "phone" (case insensitive)
-- and is similar to "smart"
SELECT name
FROM products
WHERE 
    name ~* 'phone' AND 
    name % 'smart'
ORDER BY similarity(name, 'smart') DESC;

Expected output:

      name      
----------------
 Smart Phone X3
 Smartphone X1

Performance Comparison

Let's compare the performance of a regular text search versus a trigram-indexed search:

-- Without index (will perform sequential scan)
EXPLAIN ANALYZE
SELECT name FROM products WHERE name ILIKE '%smartp%';

-- With trigram index
EXPLAIN ANALYZE
SELECT name FROM products WHERE name % 'smartp';

For a large table, you'd see that the trigram index significantly improves performance by avoiding sequential scans.

How pg_trgm Works Behind the Scenes

The pg_trgm extension works by:

Breaking strings into trigrams
Counting the number of shared trigrams
Calculating similarity based on the formula:
- similarity = (number of shared trigrams) / (total unique trigrams in both strings)

Common Use Cases

Here are some real-world applications for pg_trgm:

Search engines - Implementing "Did you mean" suggestions
CRM systems - Preventing duplicate customer entries
Product catalogs - Enabling fuzzy search for products
Address verification - Finding similar addresses
Content management - Identifying similar articles or posts

Limitations and Considerations

Performance: While trigram indexes are efficient, they consume more space than regular indexes.
Language sensitivity: Trigram matching works best with Latin-based languages.
Threshold tuning: You may need to experiment with different similarity thresholds for your specific use case.
String length: Very short strings (less than 3 characters) will have fewer trigrams, which may affect similarity calculations.

Summary

The pg_trgm extension provides powerful tools for text similarity matching in PostgreSQL. By leveraging trigram-based functions and operators, you can implement fuzzy search, autocomplete, and duplicate detection features in your applications.

Key takeaways:

Trigrams are groups of three consecutive characters
Similarity is measured by the ratio of shared trigrams
GIN and GiST indexes dramatically improve performance
The similarity threshold can be adjusted to control matching precision

Additional Resources

Exercises

Create a table of customer names and implement a search that finds customers even when their names are misspelled.
Build an autocomplete feature for a product search using word similarity operators.
Compare the performance of GIN versus GiST indexes for your specific dataset.
Experiment with different similarity thresholds to find the optimal balance between precision and recall for your use case.
Implement a "potential duplicates" detector that finds records that might be duplicates based on similarity.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Enabling the pg_trgm Extension​

Understanding Trigrams​

Similarity Functions​

1. similarity()​

2. word_similarity()​

3. strict_word_similarity()​

Similarity Operators​

Setting the Similarity Threshold​

Creating a Table for Examples​

Creating Trigram Indexes​

GIN Index (Generalized Inverted Index)​

GiST Index (Generalized Search Tree)​

Practical Examples​

Fuzzy Search​

Finding Similar Product Descriptions​

Autocomplete Implementation​

Combining with Regular Expressions​

Performance Comparison​

How pg_trgm Works Behind the Scenes​

Common Use Cases​

Limitations and Considerations​

Summary​

Additional Resources​

Exercises​

Introduction

Enabling the pg_trgm Extension

Understanding Trigrams

Similarity Functions

1. similarity()

2. word_similarity()

3. strict_word_similarity()

Similarity Operators

Setting the Similarity Threshold

Creating a Table for Examples

Creating Trigram Indexes

GIN Index (Generalized Inverted Index)

GiST Index (Generalized Search Tree)

Practical Examples

Fuzzy Search

Finding Similar Product Descriptions

Autocomplete Implementation

Combining with Regular Expressions

Performance Comparison

How pg_trgm Works Behind the Scenes

Common Use Cases

Limitations and Considerations

Summary

Additional Resources

Exercises