PostgreSQL Collations

Introduction

When working with text data in PostgreSQL, especially across multiple languages or regions, you'll quickly encounter the need for proper text sorting and comparison. This is where collations come in.

A collation is a set of rules that determine how text data is sorted, compared, and processed in a database. Different languages and regions have different sorting rules. For instance, in Spanish, the letter 'ñ' comes after 'n', while in English, it would be treated as a variant of 'n'.

In this guide, we'll explore how PostgreSQL implements collations, why they're important, and how to effectively use them in your applications.

Understanding Collations in PostgreSQL

What is a Collation?

A collation defines specific rules for:

Comparing strings (which comes first in alphabetical order)
Case sensitivity
Accent sensitivity
Language-specific sorting rules

PostgreSQL supports collations through the operating system's locale features and through the ICU (International Components for Unicode) library.

Why Collations Matter

Consider these examples of how sorting can differ based on collation:

Language	Sorting Order Example
English	a, b, c, ... z
Swedish	a, b, c, ... z, å, ä, ö
German	a, b, c, ... z, ä, ö, ü, ß

Using the right collation ensures that:

Your application correctly sorts text in language-appropriate ways
String comparisons work as expected for users from different regions
Your database performs optimally with indexes on text columns

Working with Collations in PostgreSQL

Checking Available Collations

To see what collations are available in your PostgreSQL installation:

SELECT * FROM pg_collation;

Or for a more readable list:

SELECT collname, collprovider 
FROM pg_collation 
ORDER BY collname;

Example output:

   collname    | collprovider 
---------------+--------------
 C             | c
 POSIX         | c
 de_DE         | c
 en_US         | c
 en_US.utf8    | c
 es_ES.utf8    | c
 fr_FR.utf8    | c
 ...

The provider 'c' indicates collations provided by the operating system, while 'i' would indicate ICU collations.

Creating a Database with a Specific Collation

When creating a database, you can specify its default collation:

CREATE DATABASE myapp
    WITH ENCODING 'UTF8'
    LC_COLLATE 'en_US.utf8'
    LC_CTYPE 'en_US.utf8';

This sets the default collation for the entire database to English (US).

Specifying Collation for Columns

You can specify a collation when creating a table:

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT COLLATE "en_US",
    description TEXT
);

Or when altering an existing table:

ALTER TABLE products 
ALTER COLUMN name 
SET DATA TYPE TEXT COLLATE "es_ES.utf8";

Using Collations in Queries

You can explicitly use collations in your queries:

-- Sort a result set using a specific collation
SELECT name FROM products
ORDER BY name COLLATE "de_DE";

-- Compare strings using a specific collation
SELECT * FROM products
WHERE name = 'café' COLLATE "fr_FR";

Practical Examples

Example 1: Creating a Multi-language Product Catalog

Imagine you're building an e-commerce site that serves customers in multiple countries:

-- Create a table for products with language-specific collations
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name_en TEXT COLLATE "en_US.utf8",
    name_es TEXT COLLATE "es_ES.utf8",
    name_de TEXT COLLATE "de_DE.utf8",
    price NUMERIC(10,2)
);

-- Insert some products
INSERT INTO products (name_en, name_es, name_de, price)
VALUES 
    ('Coffee Maker', 'Cafetera', 'Kaffeemaschine', 49.99),
    ('Blender', 'Licuadora', 'Mixer', 39.99),
    ('Toaster', 'Tostadora', 'Toaster', 29.99);

-- Query with language-specific sorting
SELECT name_es, price FROM products
ORDER BY name_es COLLATE "es_ES.utf8";

Result:

  name_es   | price  
------------+--------
 Cafetera   |  49.99
 Licuadora  |  39.99
 Tostadora  |  29.99

Example 2: Case-Insensitive Search

For many applications, you might want case-insensitive searching:

-- Create a table with a case-insensitive collation
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username TEXT COLLATE "en_US.utf8" UNIQUE,
    email TEXT COLLATE "en_US.utf8" UNIQUE,
    password_hash TEXT
);

-- Insert a user
INSERT INTO users (username, email, password_hash)
VALUES ('JohnDoe', '[email protected]', 'hash123');

-- Search case-insensitively
SELECT * FROM users 
WHERE username = 'johndoe' COLLATE "en_US.utf8";

This might not find the user because the default "en_US.utf8" collation is typically case-sensitive. Let's fix that:

-- Create a case-insensitive collation (PostgreSQL 12+)
CREATE COLLATION case_insensitive (
    provider = icu,
    locale = 'en-US',
    deterministic = false
);

-- Create the table with our custom collation
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username TEXT COLLATE case_insensitive UNIQUE,
    email TEXT COLLATE case_insensitive UNIQUE,
    password_hash TEXT
);

-- Now our search works regardless of case
INSERT INTO users (username, email, password_hash)
VALUES ('JohnDoe', '[email protected]', 'hash123');

SELECT * FROM users 
WHERE username = 'johndoe';

Example 3: Handling Accented Characters

Working with languages that use accents requires special attention:

-- Create a table for customer names
CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name TEXT COLLATE "fr_FR.utf8"
);

-- Insert names with accents
INSERT INTO customers (name) VALUES
    ('François'),
    ('Hélène'),
    ('André'),
    ('Pierre');

-- Sort respecting French rules
SELECT name FROM customers
ORDER BY name COLLATE "fr_FR.utf8";

Result (sorted according to French rules, where accents matter in secondary ordering):

  name    
----------
 André
 François
 Hélène
 Pierre

Creating Custom Collations

In PostgreSQL 10 and above, you can create custom collations:

-- Create a custom ICU collation
CREATE COLLATION french_phone (
    provider = icu,
    locale = 'fr-u-ks-level1',
    deterministic = false
);

-- This creates a French collation suitable for phone books
-- where accents are ignored for primary sorting

You can then use this collation in your tables and queries:

CREATE TABLE contacts (
    id SERIAL PRIMARY KEY,
    name TEXT COLLATE french_phone
);

INSERT INTO contacts (name) VALUES ('Étienne'), ('Eric'), ('Émile');

SELECT name FROM contacts ORDER BY name;

The result would sort these names together as if they all started with 'E'.

Performance Considerations

Indexes and Collations

When you create an index on a text column, the index uses the column's collation:

-- Create an index that uses the column's collation
CREATE INDEX idx_products_name ON products(name);

If you frequently query with a different collation, consider creating a dedicated index:

-- Create an index with a specific collation
CREATE INDEX idx_products_name_german ON products(name COLLATE "de_DE");

Collation Performance Impact

Collations can impact performance in several ways:

Complex collations (like those for East Asian languages) may be slower
Non-deterministic collations (those that consider multiple characters as equivalent) can't use certain optimization techniques
Changing collations frequently in queries can prevent index usage

Common Issues and Solutions

"Collation mismatch" Errors

If you see an error like:

ERROR: collation mismatch between explicit collation "en_US" and explicit collation "fr_FR"

It means you're trying to compare strings with incompatible collations. Fix it by explicitly choosing one collation:

-- Instead of this (which will error):
SELECT * FROM table1 JOIN table2 
ON table1.col1 = table2.col2;

-- Do this (specify which collation to use):
SELECT * FROM table1 JOIN table2 
ON table1.col1 = table2.col2 COLLATE "en_US";

Upgrading PostgreSQL and Collation Versions

When upgrading PostgreSQL, collation definitions might change. Run:

-- Check for collation version mismatches
SELECT pg_collation_actual_version(collname), collname 
FROM pg_collation 
WHERE collprovider = 'i'; -- ICU collations

-- If needed, refresh collations
ALTER COLLATION "en_US" REFRESH VERSION;

Summary

Collations are a crucial feature in PostgreSQL that enable proper handling of text data across different languages and regions. By understanding and correctly implementing collations, you can ensure that your application:

Sorts text data in a culturally appropriate way
Performs text comparisons that match user expectations
Optimizes database performance for text operations

Key points to remember:

Collations control sorting, comparison, and case sensitivity
PostgreSQL supports both OS-provided and ICU collations
You can specify collations at the database, table, column, or query level
Custom collations can be created for specialized needs
The right collations improve both functionality and performance

Additional Resources

Exercises

Create a database with a default collation for your preferred language.
Build a table that stores product names in three different languages, using appropriate collations for each.
Create a custom case-insensitive, accent-insensitive collation and use it for a search function.
Compare the performance of queries using different collations with a large dataset.
Write a function that properly sorts a list of names according to the rules of a specific language.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Collations in PostgreSQL​

What is a Collation?​

Why Collations Matter​

Working with Collations in PostgreSQL​

Checking Available Collations​

Creating a Database with a Specific Collation​

Specifying Collation for Columns​

Using Collations in Queries​

Practical Examples​

Example 1: Creating a Multi-language Product Catalog​

Example 2: Case-Insensitive Search​

Example 3: Handling Accented Characters​

Creating Custom Collations​

Performance Considerations​

Indexes and Collations​

Collation Performance Impact​

Common Issues and Solutions​

"Collation mismatch" Errors​

Upgrading PostgreSQL and Collation Versions​

Summary​

Additional Resources​

Exercises​