PostgreSQL Collations
Introduction
When working with text data in PostgreSQL, especially across multiple languages or regions, you'll quickly encounter the need for proper text sorting and comparison. This is where collations come in.
A collation is a set of rules that determine how text data is sorted, compared, and processed in a database. Different languages and regions have different sorting rules. For instance, in Spanish, the letter 'ñ' comes after 'n', while in English, it would be treated as a variant of 'n'.
In this guide, we'll explore how PostgreSQL implements collations, why they're important, and how to effectively use them in your applications.
Understanding Collations in PostgreSQL
What is a Collation?
A collation defines specific rules for:
- Comparing strings (which comes first in alphabetical order)
- Case sensitivity
- Accent sensitivity
- Language-specific sorting rules
PostgreSQL supports collations through the operating system's locale features and through the ICU (International Components for Unicode) library.
Why Collations Matter
Consider these examples of how sorting can differ based on collation:
Language | Sorting Order Example |
---|---|
English | a, b, c, ... z |
Swedish | a, b, c, ... z, å, ä, ö |
German | a, b, c, ... z, ä, ö, ü, ß |
Using the right collation ensures that:
- Your application correctly sorts text in language-appropriate ways
- String comparisons work as expected for users from different regions
- Your database performs optimally with indexes on text columns
Working with Collations in PostgreSQL
Checking Available Collations
To see what collations are available in your PostgreSQL installation:
SELECT * FROM pg_collation;
Or for a more readable list:
SELECT collname, collprovider
FROM pg_collation
ORDER BY collname;
Example output:
collname | collprovider
---------------+--------------
C | c
POSIX | c
de_DE | c
en_US | c
en_US.utf8 | c
es_ES.utf8 | c
fr_FR.utf8 | c
...
The provider 'c' indicates collations provided by the operating system, while 'i' would indicate ICU collations.
Creating a Database with a Specific Collation
When creating a database, you can specify its default collation:
CREATE DATABASE myapp
WITH ENCODING 'UTF8'
LC_COLLATE 'en_US.utf8'
LC_CTYPE 'en_US.utf8';
This sets the default collation for the entire database to English (US).
Specifying Collation for Columns
You can specify a collation when creating a table:
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT COLLATE "en_US",
description TEXT
);
Or when altering an existing table:
ALTER TABLE products
ALTER COLUMN name
SET DATA TYPE TEXT COLLATE "es_ES.utf8";
Using Collations in Queries
You can explicitly use collations in your queries:
-- Sort a result set using a specific collation
SELECT name FROM products
ORDER BY name COLLATE "de_DE";
-- Compare strings using a specific collation
SELECT * FROM products
WHERE name = 'café' COLLATE "fr_FR";
Practical Examples
Example 1: Creating a Multi-language Product Catalog
Imagine you're building an e-commerce site that serves customers in multiple countries:
-- Create a table for products with language-specific collations
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name_en TEXT COLLATE "en_US.utf8",
name_es TEXT COLLATE "es_ES.utf8",
name_de TEXT COLLATE "de_DE.utf8",
price NUMERIC(10,2)
);
-- Insert some products
INSERT INTO products (name_en, name_es, name_de, price)
VALUES
('Coffee Maker', 'Cafetera', 'Kaffeemaschine', 49.99),
('Blender', 'Licuadora', 'Mixer', 39.99),
('Toaster', 'Tostadora', 'Toaster', 29.99);
-- Query with language-specific sorting
SELECT name_es, price FROM products
ORDER BY name_es COLLATE "es_ES.utf8";
Result:
name_es | price
------------+--------
Cafetera | 49.99
Licuadora | 39.99
Tostadora | 29.99
Example 2: Case-Insensitive Search
For many applications, you might want case-insensitive searching:
-- Create a table with a case-insensitive collation
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username TEXT COLLATE "en_US.utf8" UNIQUE,
email TEXT COLLATE "en_US.utf8" UNIQUE,
password_hash TEXT
);
-- Insert a user
INSERT INTO users (username, email, password_hash)
VALUES ('JohnDoe', '[email protected]', 'hash123');
-- Search case-insensitively
SELECT * FROM users
WHERE username = 'johndoe' COLLATE "en_US.utf8";
This might not find the user because the default "en_US.utf8" collation is typically case-sensitive. Let's fix that:
-- Create a case-insensitive collation (PostgreSQL 12+)
CREATE COLLATION case_insensitive (
provider = icu,
locale = 'en-US',
deterministic = false
);
-- Create the table with our custom collation
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username TEXT COLLATE case_insensitive UNIQUE,
email TEXT COLLATE case_insensitive UNIQUE,
password_hash TEXT
);
-- Now our search works regardless of case
INSERT INTO users (username, email, password_hash)
VALUES ('JohnDoe', '[email protected]', 'hash123');
SELECT * FROM users
WHERE username = 'johndoe';
Example 3: Handling Accented Characters
Working with languages that use accents requires special attention:
-- Create a table for customer names
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT COLLATE "fr_FR.utf8"
);
-- Insert names with accents
INSERT INTO customers (name) VALUES
('François'),
('Hélène'),
('André'),
('Pierre');
-- Sort respecting French rules
SELECT name FROM customers
ORDER BY name COLLATE "fr_FR.utf8";
Result (sorted according to French rules, where accents matter in secondary ordering):
name
----------
André
François
Hélène
Pierre
Creating Custom Collations
In PostgreSQL 10 and above, you can create custom collations:
-- Create a custom ICU collation
CREATE COLLATION french_phone (
provider = icu,
locale = 'fr-u-ks-level1',
deterministic = false
);
-- This creates a French collation suitable for phone books
-- where accents are ignored for primary sorting
You can then use this collation in your tables and queries:
CREATE TABLE contacts (
id SERIAL PRIMARY KEY,
name TEXT COLLATE french_phone
);
INSERT INTO contacts (name) VALUES ('Étienne'), ('Eric'), ('Émile');
SELECT name FROM contacts ORDER BY name;
The result would sort these names together as if they all started with 'E'.
Performance Considerations
Indexes and Collations
When you create an index on a text column, the index uses the column's collation:
-- Create an index that uses the column's collation
CREATE INDEX idx_products_name ON products(name);
If you frequently query with a different collation, consider creating a dedicated index:
-- Create an index with a specific collation
CREATE INDEX idx_products_name_german ON products(name COLLATE "de_DE");
Collation Performance Impact
Collations can impact performance in several ways:
- Complex collations (like those for East Asian languages) may be slower
- Non-deterministic collations (those that consider multiple characters as equivalent) can't use certain optimization techniques
- Changing collations frequently in queries can prevent index usage
Common Issues and Solutions
"Collation mismatch" Errors
If you see an error like:
ERROR: collation mismatch between explicit collation "en_US" and explicit collation "fr_FR"
It means you're trying to compare strings with incompatible collations. Fix it by explicitly choosing one collation:
-- Instead of this (which will error):
SELECT * FROM table1 JOIN table2
ON table1.col1 = table2.col2;
-- Do this (specify which collation to use):
SELECT * FROM table1 JOIN table2
ON table1.col1 = table2.col2 COLLATE "en_US";
Upgrading PostgreSQL and Collation Versions
When upgrading PostgreSQL, collation definitions might change. Run:
-- Check for collation version mismatches
SELECT pg_collation_actual_version(collname), collname
FROM pg_collation
WHERE collprovider = 'i'; -- ICU collations
-- If needed, refresh collations
ALTER COLLATION "en_US" REFRESH VERSION;
Summary
Collations are a crucial feature in PostgreSQL that enable proper handling of text data across different languages and regions. By understanding and correctly implementing collations, you can ensure that your application:
- Sorts text data in a culturally appropriate way
- Performs text comparisons that match user expectations
- Optimizes database performance for text operations
Key points to remember:
- Collations control sorting, comparison, and case sensitivity
- PostgreSQL supports both OS-provided and ICU collations
- You can specify collations at the database, table, column, or query level
- Custom collations can be created for specialized needs
- The right collations improve both functionality and performance
Additional Resources
Exercises
- Create a database with a default collation for your preferred language.
- Build a table that stores product names in three different languages, using appropriate collations for each.
- Create a custom case-insensitive, accent-insensitive collation and use it for a search function.
- Compare the performance of queries using different collations with a large dataset.
- Write a function that properly sorts a list of names according to the rules of a specific language.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)