Skip to main content

PostgreSQL Collations

Introduction

When working with text data in PostgreSQL, especially across multiple languages or regions, you'll quickly encounter the need for proper text sorting and comparison. This is where collations come in.

A collation is a set of rules that determine how text data is sorted, compared, and processed in a database. Different languages and regions have different sorting rules. For instance, in Spanish, the letter 'ñ' comes after 'n', while in English, it would be treated as a variant of 'n'.

In this guide, we'll explore how PostgreSQL implements collations, why they're important, and how to effectively use them in your applications.

Understanding Collations in PostgreSQL

What is a Collation?

A collation defines specific rules for:

  • Comparing strings (which comes first in alphabetical order)
  • Case sensitivity
  • Accent sensitivity
  • Language-specific sorting rules

PostgreSQL supports collations through the operating system's locale features and through the ICU (International Components for Unicode) library.

Why Collations Matter

Consider these examples of how sorting can differ based on collation:

LanguageSorting Order Example
Englisha, b, c, ... z
Swedisha, b, c, ... z, å, ä, ö
Germana, b, c, ... z, ä, ö, ü, ß

Using the right collation ensures that:

  1. Your application correctly sorts text in language-appropriate ways
  2. String comparisons work as expected for users from different regions
  3. Your database performs optimally with indexes on text columns

Working with Collations in PostgreSQL

Checking Available Collations

To see what collations are available in your PostgreSQL installation:

sql
SELECT * FROM pg_collation;

Or for a more readable list:

sql
SELECT collname, collprovider 
FROM pg_collation
ORDER BY collname;

Example output:

   collname    | collprovider 
---------------+--------------
C | c
POSIX | c
de_DE | c
en_US | c
en_US.utf8 | c
es_ES.utf8 | c
fr_FR.utf8 | c
...

The provider 'c' indicates collations provided by the operating system, while 'i' would indicate ICU collations.

Creating a Database with a Specific Collation

When creating a database, you can specify its default collation:

sql
CREATE DATABASE myapp
WITH ENCODING 'UTF8'
LC_COLLATE 'en_US.utf8'
LC_CTYPE 'en_US.utf8';

This sets the default collation for the entire database to English (US).

Specifying Collation for Columns

You can specify a collation when creating a table:

sql
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT COLLATE "en_US",
description TEXT
);

Or when altering an existing table:

sql
ALTER TABLE products 
ALTER COLUMN name
SET DATA TYPE TEXT COLLATE "es_ES.utf8";

Using Collations in Queries

You can explicitly use collations in your queries:

sql
-- Sort a result set using a specific collation
SELECT name FROM products
ORDER BY name COLLATE "de_DE";

-- Compare strings using a specific collation
SELECT * FROM products
WHERE name = 'café' COLLATE "fr_FR";

Practical Examples

Example 1: Creating a Multi-language Product Catalog

Imagine you're building an e-commerce site that serves customers in multiple countries:

sql
-- Create a table for products with language-specific collations
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name_en TEXT COLLATE "en_US.utf8",
name_es TEXT COLLATE "es_ES.utf8",
name_de TEXT COLLATE "de_DE.utf8",
price NUMERIC(10,2)
);

-- Insert some products
INSERT INTO products (name_en, name_es, name_de, price)
VALUES
('Coffee Maker', 'Cafetera', 'Kaffeemaschine', 49.99),
('Blender', 'Licuadora', 'Mixer', 39.99),
('Toaster', 'Tostadora', 'Toaster', 29.99);

-- Query with language-specific sorting
SELECT name_es, price FROM products
ORDER BY name_es COLLATE "es_ES.utf8";

Result:

  name_es   | price  
------------+--------
Cafetera | 49.99
Licuadora | 39.99
Tostadora | 29.99

For many applications, you might want case-insensitive searching:

sql
-- Create a table with a case-insensitive collation
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username TEXT COLLATE "en_US.utf8" UNIQUE,
email TEXT COLLATE "en_US.utf8" UNIQUE,
password_hash TEXT
);

-- Insert a user
INSERT INTO users (username, email, password_hash)
VALUES ('JohnDoe', '[email protected]', 'hash123');

-- Search case-insensitively
SELECT * FROM users
WHERE username = 'johndoe' COLLATE "en_US.utf8";

This might not find the user because the default "en_US.utf8" collation is typically case-sensitive. Let's fix that:

sql
-- Create a case-insensitive collation (PostgreSQL 12+)
CREATE COLLATION case_insensitive (
provider = icu,
locale = 'en-US',
deterministic = false
);

-- Create the table with our custom collation
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username TEXT COLLATE case_insensitive UNIQUE,
email TEXT COLLATE case_insensitive UNIQUE,
password_hash TEXT
);

-- Now our search works regardless of case
INSERT INTO users (username, email, password_hash)
VALUES ('JohnDoe', '[email protected]', 'hash123');

SELECT * FROM users
WHERE username = 'johndoe';

Example 3: Handling Accented Characters

Working with languages that use accents requires special attention:

sql
-- Create a table for customer names
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT COLLATE "fr_FR.utf8"
);

-- Insert names with accents
INSERT INTO customers (name) VALUES
('François'),
('Hélène'),
('André'),
('Pierre');

-- Sort respecting French rules
SELECT name FROM customers
ORDER BY name COLLATE "fr_FR.utf8";

Result (sorted according to French rules, where accents matter in secondary ordering):

  name    
----------
André
François
Hélène
Pierre

Creating Custom Collations

In PostgreSQL 10 and above, you can create custom collations:

sql
-- Create a custom ICU collation
CREATE COLLATION french_phone (
provider = icu,
locale = 'fr-u-ks-level1',
deterministic = false
);

-- This creates a French collation suitable for phone books
-- where accents are ignored for primary sorting

You can then use this collation in your tables and queries:

sql
CREATE TABLE contacts (
id SERIAL PRIMARY KEY,
name TEXT COLLATE french_phone
);

INSERT INTO contacts (name) VALUES ('Étienne'), ('Eric'), ('Émile');

SELECT name FROM contacts ORDER BY name;

The result would sort these names together as if they all started with 'E'.

Performance Considerations

Indexes and Collations

When you create an index on a text column, the index uses the column's collation:

sql
-- Create an index that uses the column's collation
CREATE INDEX idx_products_name ON products(name);

If you frequently query with a different collation, consider creating a dedicated index:

sql
-- Create an index with a specific collation
CREATE INDEX idx_products_name_german ON products(name COLLATE "de_DE");

Collation Performance Impact

Collations can impact performance in several ways:

  1. Complex collations (like those for East Asian languages) may be slower
  2. Non-deterministic collations (those that consider multiple characters as equivalent) can't use certain optimization techniques
  3. Changing collations frequently in queries can prevent index usage

Common Issues and Solutions

"Collation mismatch" Errors

If you see an error like:

ERROR: collation mismatch between explicit collation "en_US" and explicit collation "fr_FR"

It means you're trying to compare strings with incompatible collations. Fix it by explicitly choosing one collation:

sql
-- Instead of this (which will error):
SELECT * FROM table1 JOIN table2
ON table1.col1 = table2.col2;

-- Do this (specify which collation to use):
SELECT * FROM table1 JOIN table2
ON table1.col1 = table2.col2 COLLATE "en_US";

Upgrading PostgreSQL and Collation Versions

When upgrading PostgreSQL, collation definitions might change. Run:

sql
-- Check for collation version mismatches
SELECT pg_collation_actual_version(collname), collname
FROM pg_collation
WHERE collprovider = 'i'; -- ICU collations

-- If needed, refresh collations
ALTER COLLATION "en_US" REFRESH VERSION;

Summary

Collations are a crucial feature in PostgreSQL that enable proper handling of text data across different languages and regions. By understanding and correctly implementing collations, you can ensure that your application:

  • Sorts text data in a culturally appropriate way
  • Performs text comparisons that match user expectations
  • Optimizes database performance for text operations

Key points to remember:

  1. Collations control sorting, comparison, and case sensitivity
  2. PostgreSQL supports both OS-provided and ICU collations
  3. You can specify collations at the database, table, column, or query level
  4. Custom collations can be created for specialized needs
  5. The right collations improve both functionality and performance

Additional Resources

Exercises

  1. Create a database with a default collation for your preferred language.
  2. Build a table that stores product names in three different languages, using appropriate collations for each.
  3. Create a custom case-insensitive, accent-insensitive collation and use it for a search function.
  4. Compare the performance of queries using different collations with a large dataset.
  5. Write a function that properly sorts a list of names according to the rules of a specific language.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)