Skip to main content

MySQL Collations

Introduction

When working with databases that store text in multiple languages, you might encounter unexpected behaviors when comparing or sorting strings. For example, in some languages, certain characters are treated as equivalent during sorting, while in others, they're considered distinct. MySQL's collation system helps manage these language-specific rules for string comparison and sorting.

In this tutorial, you'll learn:

  • What MySQL collations are and why they're important
  • How to view and set collations at different levels
  • Common collation types and their use cases
  • How to choose the right collation for your application
  • Troubleshooting common collation-related issues

What Are MySQL Collations?

A collation is a set of rules that defines how character strings are compared and sorted in a database. It determines whether 'a' equals 'A', where 'ö' appears in sort order, and other language-specific behaviors.

Collations in MySQL are tied to character sets. A character set defines which symbols are available, while a collation defines how these symbols compare to each other. Each character set in MySQL has at least one collation, and most have several.

Naming Convention

MySQL collation names follow a specific pattern:

  • They start with the character set name
  • Followed by language or purpose identifiers
  • May include additional attributes such as case sensitivity

For example:

  • utf8mb4_general_ci: UTF-8 encoding that uses general language rules and is case-insensitive
  • latin1_swedish_cs: Latin1 encoding that follows Swedish rules and is case-sensitive

Collation Levels in MySQL

MySQL supports collations at four different levels:

  1. Server level - The default for the entire server
  2. Database level - The default for a specific database
  3. Table level - The default for a specific table
  4. Column level - The collation for a specific column

Viewing Collation Information

Let's explore how to view collation settings at different levels:

sql
-- Check server character set and collation defaults
SHOW VARIABLES LIKE 'character_set_server';
SHOW VARIABLES LIKE 'collation_server';

-- View all available character sets
SHOW CHARACTER SET;

-- View all available collations
SHOW COLLATION;

-- Filter to see collations for a specific character set
SHOW COLLATION WHERE Charset = 'utf8mb4';

Output for SHOW COLLATION WHERE Charset = 'utf8mb4' (partial):

+----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
| utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD |
| utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE |
| utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | PAD SPACE |
| utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 0 | NO PAD |
+----------------------------+---------+-----+---------+----------+---------+---------------+

To check collation settings of databases and tables:

sql
-- Check database collations
SELECT schema_name, default_character_set_name, default_collation_name
FROM information_schema.schemata;

-- Check table collations in the current database
SHOW TABLE STATUS;

-- Check column collations for a specific table
SHOW FULL COLUMNS FROM your_table_name;

Setting Collations

You can set collations when creating databases, tables, or columns, or alter existing objects:

sql
-- Create a database with a specific collation
CREATE DATABASE multilingual_app
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

-- Create a table with a specific collation
CREATE TABLE customer_feedback (
id INT AUTO_INCREMENT PRIMARY KEY,
customer_name VARCHAR(100),
feedback TEXT
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Create a table with column-specific collations
CREATE TABLE products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
code VARCHAR(20) CHARACTER SET ascii COLLATE ascii_general_ci
);

Altering Existing Collations

sql
-- Change database collation
ALTER DATABASE your_database_name
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Change table collation
ALTER TABLE your_table_name
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Change column collation
ALTER TABLE your_table_name
MODIFY your_column_name VARCHAR(100)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Common MySQL Collation Types

MySQL offers several collation types that serve different purposes:

Case Sensitivity

Collations with these suffixes determine case sensitivity:

  • _ci - Case insensitive (e.g., 'a' = 'A')
  • _cs - Case sensitive (e.g., 'a' ≠ 'A')

Language-Specific Collations

MySQL provides collations optimized for specific languages:

  • _general - Generic rules suitable for Western European languages
  • _unicode - Based on Unicode Collation Algorithm, good for multilingual data
  • Language-specific, e.g., _german, _spanish, _swedish

Binary Collation

The _bin suffix creates a binary collation that compares strings byte by byte, treating each character according to its code point value. This is useful when you need exact binary matching.

Practical Examples

Let's see how collations affect search results:

sql
-- Create tables with different collations
CREATE TABLE users_ci (
id INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50) COLLATE utf8mb4_general_ci
);

CREATE TABLE users_cs (
id INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50) COLLATE utf8mb4_general_cs
);

-- Insert the same data in both tables
INSERT INTO users_ci (username) VALUES ('John'), ('john'), ('JOHN');
INSERT INTO users_cs (username) VALUES ('John'), ('john'), ('JOHN');

-- Case-insensitive search (returns 3 rows)
SELECT * FROM users_ci WHERE username = 'john';

-- Case-sensitive search (returns 1 row)
SELECT * FROM users_cs WHERE username = 'john';

Output for case-insensitive search:

+----+----------+
| id | username |
+----+----------+
| 1 | John |
| 2 | john |
| 3 | JOHN |
+----+----------+

Output for case-sensitive search:

+----+----------+
| id | username |
+----+----------+
| 2 | john |
+----+----------+

Example 2: Sorting Differences Between Collations

sql
-- Create tables with different collations
CREATE TABLE words_swedish (
word VARCHAR(50) COLLATE utf8mb4_swedish_ci
);

CREATE TABLE words_german (
word VARCHAR(50) COLLATE utf8mb4_german2_ci
);

-- Insert words with accented characters
INSERT INTO words_swedish (word) VALUES ('a'), ('b'), ('o'), ('p'), ('å'), ('ä'), ('ö');
INSERT INTO words_german (word) VALUES ('a'), ('b'), ('o'), ('p'), ('å'), ('ä'), ('ö');

-- Different sorting orders based on collation
SELECT word FROM words_swedish ORDER BY word;
SELECT word FROM words_german ORDER BY word;

Output for Swedish collation (where 'å', 'ä', 'ö' come after 'z'):

+------+
| word |
+------+
| a |
| b |
| o |
| p |
| å |
| ä |
| ö |
+------+

Output for German collation (where 'ä' is treated similar to 'ae'):

+------+
| word |
+------+
| a |
| ä |
| b |
| o |
| ö |
| p |
| å |
+------+

Example 3: Using Different Collations In A JOIN

sql
-- Create tables with different collations
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50) COLLATE utf8mb4_general_ci
);

CREATE TABLE departments (
id INT PRIMARY KEY,
manager VARCHAR(50) COLLATE utf8mb4_bin
);

-- Insert sample data
INSERT INTO employees VALUES (1, 'John'), (2, 'Maria');
INSERT INTO departments VALUES (1, 'john'), (2, 'maria');

-- This JOIN will fail due to collation mismatch
SELECT e.name, d.id FROM employees e
JOIN departments d ON e.name = d.manager;

The above query will return empty results because the collations don't match. To fix it, you can specify a collation in the JOIN condition:

sql
-- Explicitly set collation for the comparison
SELECT e.name, d.id FROM employees e
JOIN departments d ON e.name = d.manager COLLATE utf8mb4_general_ci;

Now the query works correctly and returns:

+-------+----+
| name | id |
+-------+----+
| John | 1 |
| Maria | 2 |
+-------+----+

Best Practices for Using Collations

1. Choose UTF-8 for New Applications

For most new applications, using utf8mb4 with an appropriate collation is recommended:

  • utf8mb4_unicode_ci - Good general-purpose collation for multiple languages
  • utf8mb4_0900_ai_ci - Modern, faster collation for MySQL 8.0 and above

2. Be Consistent

Use the same collation throughout your schema when possible, especially for columns that might be used in JOINs or comparisons.

3. Choose Based on Requirements

  • For exact matching (like passwords or hashes): Use _bin collations
  • For user-visible text in a single language: Use language-specific collations
  • For multilingual applications: Use Unicode collations

4. Consider Performance

Some collations are more computationally expensive than others. Binary (_bin) collations are generally faster, while complex Unicode collations might be slower.

5. Test Your Application

Test sorting and comparison with real data to ensure the chosen collation meets your requirements.

Common Issues and Troubleshooting

1. "Illegal mix of collations" Error

This error occurs when MySQL tries to compare strings with incompatible collations:

sql
-- Fix with COLLATE clause
SELECT * FROM table1 JOIN table2
ON table1.column = table2.column COLLATE utf8mb4_general_ci;

2. Unexpected Sorting Results

If you're getting unexpected sort orders, check your collation:

sql
-- Force a specific collation for sorting
SELECT name FROM customers
ORDER BY name COLLATE utf8mb4_swedish_ci;

3. Case-Sensitivity Issues

If search results don't match as expected, consider the case sensitivity of your collation:

sql
-- Make a case-insensitive comparison regardless of column collation
SELECT * FROM users
WHERE username COLLATE utf8mb4_general_ci = 'john';

Summary

MySQL collations play a crucial role in how string comparison and sorting work in your database. Understanding and correctly setting collations helps ensure your application behaves consistently across different languages and character sets.

Key points to remember:

  • Collations define the rules for comparing and sorting strings
  • They can be set at server, database, table, or column level
  • Choose collations based on your language requirements and performance needs
  • The most widely used collations for modern applications are utf8mb4_unicode_ci and utf8mb4_0900_ai_ci
  • Case sensitivity and language-specific sorting are determined by the collation

By selecting appropriate collations, you can ensure your database correctly handles text in any language and provides consistent search and sorting behavior.

Additional Resources and Exercises

Resources

Exercises

  1. Create a database for a multilingual blog with appropriate collations.
  2. Compare the sorting order of accented characters in different language collations.
  3. Write a query to search case-insensitively in a case-sensitive column without changing the column's collation.
  4. Investigate the performance differences between different collations using the EXPLAIN command.
  5. Create a table to store user passwords with an appropriate collation for exact matching.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)