Skip to main content

Python Unicode Handling

Introduction

When you're writing programs that need to work with text in different languages, or with special symbols, understanding Unicode becomes essential. Unicode is a standard that provides a unique number for every character, regardless of the platform, program, or language. In Python, all strings are Unicode by default, which makes it powerful for handling text in any language.

In this tutorial, you'll learn how Python handles Unicode, how to work with different character encodings, and how to solve common problems when dealing with international text.

What is Unicode?

Unicode is an international standard that assigns a unique code point (a number) to every character used in written languages around the world. Before Unicode, different encoding systems were used for different languages, making it difficult to display multiple languages in the same document.

Python 3 uses Unicode as the default encoding for strings, which means you can easily work with characters from any language without special handling.

Python Strings and Unicode

In Python 3, all strings are Unicode strings by default. This means you can include characters from different languages directly in your string literals:

python
# Basic Latin characters
hello = "Hello"

# Non-English characters
chinese_hello = "你好"
russian_hello = "Привет"
emoji = "😊"

print(hello)
print(chinese_hello)
print(russian_hello)
print(emoji)

Output:

Hello
你好
Привет
😊

Unicode Code Points and the ord() Function

Every Unicode character has a unique code point. You can find the code point of a character using the ord() function:

python
print(ord('A'))  # Latin capital A
print(ord('а')) # Cyrillic small letter a
print(ord('€')) # Euro sign
print(ord('😊')) # Smiling face emoji

Output:

65
1072
8364
128522

Conversely, you can get a character from its code point using the chr() function:

python
print(chr(65))      # Latin capital A
print(chr(1072)) # Cyrillic small letter a
print(chr(8364)) # Euro sign
print(chr(128522)) # Smiling face emoji

Output:

A
а

😊

Unicode Escape Sequences

You can represent Unicode characters in Python strings using escape sequences:

python
# Unicode escape with hexadecimal value
print('\u0041') # Latin capital A
print('\u0430') # Cyrillic small letter a
print('\u20AC') # Euro sign

# For characters beyond the Basic Multilingual Plane (like emojis)
print('\U0001F60A') # Smiling face emoji

Output:

A
а

😊

Encoding and Decoding

While Python strings are Unicode internally, when you need to store them in files or transmit them over networks, they need to be encoded into bytes. Encoding is the process of converting Unicode strings to bytes, and decoding is the process of converting bytes back to Unicode strings.

Common Encodings

  • UTF-8: A variable-width encoding that can represent any Unicode character. It's the most commonly used encoding on the web.
  • ASCII: A 7-bit encoding that can only represent basic Latin characters.
  • Latin-1 (ISO-8859-1): An 8-bit encoding that can represent Western European characters.
  • UTF-16: A variable-width encoding that uses either 2 or 4 bytes per character.

Encoding Example

python
text = "Hello, 你好, Привет, 👋"

# Encoding to different formats
utf8_bytes = text.encode('utf-8')
utf16_bytes = text.encode('utf-16')
try:
ascii_bytes = text.encode('ascii')
except UnicodeEncodeError as e:
print(f"ASCII encoding error: {e}")

print(f"Original text: {text}")
print(f"UTF-8 bytes: {utf8_bytes}")
print(f"UTF-16 bytes: {utf16_bytes}")

Output:

ASCII encoding error: 'ascii' codec can't encode character '\u4f60' in position 7: ordinal not in range(128)
Original text: Hello, 你好, Привет, 👋
UTF-8 bytes: b'Hello, \xe4\xbd\xa0\xe5\xa5\xbd, \xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xf0\x9f\x91\x8b'
UTF-16 bytes: b'\xff\xfeH\x00e\x00l\x00l\x00o\x00,\x00 \x00`O\x7d\x59,\x00 \x00\x1f\x04@\x04>\x048\x04\x32\x04V\x04B\x04,\x00 \x00=\xd8\x4b\xdc'

Notice that the ASCII encoding fails because it can't represent non-ASCII characters.

Decoding Example

python
# Decoding bytes back to strings
decoded_utf8 = utf8_bytes.decode('utf-8')
decoded_utf16 = utf16_bytes.decode('utf-16')

print(f"Decoded UTF-8: {decoded_utf8}")
print(f"Decoded UTF-16: {decoded_utf16}")

# What happens when we use the wrong encoding?
try:
wrong_decoding = utf8_bytes.decode('latin-1')
print(f"Incorrect decoding (UTF-8 as Latin-1): {wrong_decoding}")
except UnicodeDecodeError as e:
print(f"Decoding error: {e}")

Output:

Decoded UTF-8: Hello, 你好, Привет, 👋
Decoded UTF-16: Hello, 你好, Привет, 👋
Incorrect decoding (UTF-8 as Latin-1): Hello, ä½ å¥½, ÐÑивеÑ, ð

Notice that when we decode UTF-8 bytes as Latin-1, we don't get an error but the text is garbled. This happens because Latin-1 can decode any byte value (0-255), but the resulting characters may not match the original text.

Handling File Encodings

When reading from or writing to files, you need to specify the encoding to ensure proper handling of Unicode characters:

python
# Writing Unicode text to a file
with open('unicode_example.txt', 'w', encoding='utf-8') as f:
f.write("Hello, 你好, Привет, 👋")

# Reading Unicode text from a file
with open('unicode_example.txt', 'r', encoding='utf-8') as f:
content = f.read()
print(f"File content: {content}")

Output:

File content: Hello, 你好, Привет, 👋

Detecting and Handling Encoding Issues

Common encoding issues occur when:

  1. The wrong encoding is used for decoding
  2. The encoding cannot represent certain characters

Handling UnicodeDecodeError

This error occurs when you try to decode bytes with the wrong encoding:

python
chinese_bytes = "你好".encode('utf-8')

try:
chinese_text = chinese_bytes.decode('ascii')
except UnicodeDecodeError as e:
print(f"Cannot decode with ASCII: {e}")
# Use a suitable encoding instead
chinese_text = chinese_bytes.decode('utf-8')
print(f"Correctly decoded: {chinese_text}")

Output:

Cannot decode with ASCII: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
Correctly decoded: 你好

Handling UnicodeEncodeError

This error occurs when your encoding can't represent certain characters:

python
multilingual = "Hello, 你好, Привет, 👋"

try:
ascii_bytes = multilingual.encode('ascii')
except UnicodeEncodeError as e:
print(f"Cannot encode with ASCII: {e}")
# Use error handling to replace or ignore problematic characters
ascii_bytes_replace = multilingual.encode('ascii', errors='replace')
ascii_bytes_ignore = multilingual.encode('ascii', errors='ignore')
print(f"With replacement: {ascii_bytes_replace}")
print(f"With ignored characters: {ascii_bytes_ignore}")

Output:

Cannot encode with ASCII: 'ascii' codec can't encode character '\u4f60' in position 7: ordinal not in range(128)
With replacement: b'Hello, ???, ????, ?'
With ignored characters: b'Hello, , , '

Error Handling Options

Python provides several error handling strategies when encoding or decoding:

  • 'strict': Default behavior - raises an exception if there's an encoding/decoding error
  • 'replace': Replaces characters that can't be encoded/decoded with a replacement marker
  • 'ignore': Skips characters that can't be encoded/decoded
  • 'backslashreplace': Replaces with Python backslashed escape sequences
  • 'xmlcharrefreplace': Replaces with XML character references (only for encoding)
python
text = "Hello, 你好!"

for error_mode in ['strict', 'replace', 'ignore', 'backslashreplace', 'xmlcharrefreplace']:
try:
encoded = text.encode('ascii', errors=error_mode)
print(f"{error_mode}: {encoded}")
except UnicodeEncodeError as e:
print(f"{error_mode}: {e}")

Output:

strict: 'ascii' codec can't encode character '\u4f60' in position 7: ordinal not in range(128)
replace: b'Hello, ??!'
ignore: b'Hello, !'
backslashreplace: b'Hello, \\u4f60\\u597d!'
xmlcharrefreplace: b'Hello, 你好!'

Real-World Applications

Handling User Input in Web Applications

When building web applications, you often need to handle user input that may include characters from different languages:

python
def process_user_comment(comment):
# Make sure the comment is a Unicode string
if isinstance(comment, bytes):
comment = comment.decode('utf-8')

# Process the comment (e.g., count characters)
char_count = len(comment)

# Store in a database (typically requires encoding)
encoded_comment = comment.encode('utf-8')

return {
'original': comment,
'char_count': char_count,
'encoded_size': len(encoded_comment)
}

# Example user comments
comments = [
"I love Python!",
"Python是最好的编程语言!",
"Я люблю программировать на Python!",
"Python is 💯"
]

for comment in comments:
result = process_user_comment(comment)
print(f"Comment: {result['original']}")
print(f"Character count: {result['char_count']}")
print(f"Encoded size (bytes): {result['encoded_size']}")
print()

Output:

Comment: I love Python!
Character count: 14
Encoded size (bytes): 14

Comment: Python是最好的编程语言!
Character count: 14
Encoded size (bytes): 32

Comment: Я люблю программировать на Python!
Character count: 34
Encoded size (bytes): 64

Comment: Python is 💯
Character count: 11
Encoded size (bytes): 14

Notice how the character count and byte size can differ significantly depending on the language used.

Internationalization (i18n) Example

If you're building an application for users around the world, you might need to handle translations:

python
def translate_greeting(language_code):
greetings = {
'en': 'Hello, world!',
'es': '¡Hola, mundo!',
'fr': 'Bonjour, monde!',
'zh': '你好,世界!',
'ru': 'Привет, мир!',
'ar': 'مرحبا بالعالم!', # Right-to-left language
'ja': 'こんにちは、世界!'
}

return greetings.get(language_code, greetings['en'])

# Test different languages
for lang in ['en', 'es', 'zh', 'ar', 'ja']:
greeting = translate_greeting(lang)
print(f"{lang}: {greeting}")

Output:

en: Hello, world!
es: ¡Hola, mundo!
zh: 你好,世界!
ar: مرحبا بالعالم!
ja: こんにちは、世界!

Summary

Understanding Unicode handling in Python is crucial for developing applications that can work with text in any language. Here's what we've covered:

  1. Unicode Basics: Python 3 strings are Unicode by default
  2. Code Points: Using ord() and chr() to work with Unicode code points
  3. Escape Sequences: Using \u and \U to represent Unicode characters in string literals
  4. Encoding/Decoding: Converting between Unicode strings and bytes
  5. File Handling: Properly reading and writing files with the right encoding
  6. Error Handling: Strategies for dealing with encoding and decoding errors
  7. Real-World Applications: Handling international user input and translations

Python's Unicode support makes it an excellent choice for developing international applications. Just remember to always specify encodings when working with external systems like files, databases, or network communications.

Additional Resources

Exercises

  1. Write a function that counts characters in a string but treats combining characters (like accent marks) as part of the base character.
  2. Create a program that detects the encoding of a file. (Hint: You might want to look at the chardet library.)
  3. Write a function that normalizes Unicode text by converting it to a standard form (look up Unicode normalization).
  4. Create a simple transliteration function that converts non-ASCII characters to their ASCII approximations (e.g., "café" to "cafe").
  5. Build a small utility that can convert text files between different encodings while handling encoding errors gracefully.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)