Skip to main content

C# Unicode Strings

Introduction

When you work with text in C#, you're dealing with Unicode strings by default. This is a powerful feature that allows your applications to handle text in virtually any language and writing system from around the world. Whether you need to process English, Japanese, Arabic, emoji, or mathematical symbols, C# strings have you covered.

In this tutorial, we'll explore how C# implements Unicode strings, why this matters, and how you can leverage this capability in your applications.

What is Unicode?

Before diving into C# specifics, let's understand what Unicode is:

Unicode is an international standard for representing text in computer systems. It assigns a unique number (code point) to every character, regardless of platform, program, or language. Unicode aims to encompass all characters used in all writing systems worldwide.

Some key points about Unicode:

  • Unicode currently contains over 140,000 characters
  • It covers historical scripts, mathematical symbols, emoji, and more
  • Common encodings include UTF-8, UTF-16, and UTF-32

How C# Implements Unicode

In C#, the string type internally stores text as a sequence of UTF-16 code units (16-bit values). This is important to understand:

csharp
// A simple string in C#
string greeting = "Hello, world!";

// A string with non-ASCII characters
string multilingual = "Hello, 世界, Здравствуйте, مرحبا";

Console.WriteLine(greeting); // Output: Hello, world!
Console.WriteLine(multilingual); // Output: Hello, 世界, Здравствуйте, مرحبا

C# handles all this text seamlessly because strings are Unicode by default. This means:

  1. You don't need special string types to handle international characters
  2. You can mix characters from different writing systems in the same string
  3. String operations work properly on all Unicode text

Character Encoding in C#

Character vs. Code Point

It's important to understand that a Unicode character doesn't always correspond to a single 16-bit value in C#:

csharp
// Some characters require more than one char unit
string emoji = "😊"; // Smiling face emoji
Console.WriteLine(emoji.Length); // Output: 2

The output is 2 because some Unicode characters (like many emoji) require multiple 16-bit units to represent. These are called "surrogate pairs" in UTF-16.

The char Type

In C#, the char type represents a single UTF-16 code unit, not necessarily a complete Unicode character:

csharp
char letterA = 'A';
Console.WriteLine(letterA); // Output: A
Console.WriteLine((int)letterA); // Output: 65 (the Unicode code point)

// Working with Unicode escapes
char copyrightSymbol = '\u00A9';
Console.WriteLine(copyrightSymbol); // Output: ©

Working with Unicode in Strings

Unicode Escape Sequences

You can use Unicode escape sequences to include any Unicode character in a string:

csharp
string heart = "\u2764";  // Unicode escape for ❤
Console.WriteLine(heart); // Output: ❤

// For characters beyond the Basic Multilingual Plane (BMP),
// you need surrogate pairs
string musicalNote = "\uD834\uDD1E"; // 𝄞 (musical G clef symbol)
Console.WriteLine(musicalNote); // Output: 𝄞

String Normalization

Unicode provides multiple ways to represent some characters. For example, "é" can be represented as a single code point or as "e" followed by a combining accent mark. This can cause comparison issues:

csharp
string precomposed = "\u00E9";           // é as a single code point
string decomposed = "e\u0301"; // e followed by combining accent

Console.WriteLine(precomposed); // Output: é
Console.WriteLine(decomposed); // Output: é (looks the same visually)
Console.WriteLine(precomposed == decomposed); // Output: False (different representations)

// Solution: Normalize strings before comparing
using System.Text;

bool areEqual = string.Equals(
precomposed.Normalize(),
decomposed.Normalize()
);
Console.WriteLine(areEqual); // Output: True

Practical Applications

Reading and Writing Files with Different Encodings

When working with files, you need to specify the encoding:

csharp
using System.Text;
using System.IO;

// Writing a file with UTF-8 encoding (most common for Unicode text)
string content = "Hello, 世界! Здравствуйте! مرحبا!";
File.WriteAllText("sample.txt", content, Encoding.UTF8);

// Reading the file back
string readContent = File.ReadAllText("sample.txt", Encoding.UTF8);
Console.WriteLine(readContent); // Output: Hello, 世界! Здравствуйте! مرحبا!

// Using a different encoding (not recommended for Unicode)
File.WriteAllText("sample-ascii.txt", content, Encoding.ASCII);
string asciiContent = File.ReadAllText("sample-ascii.txt", Encoding.ASCII);
Console.WriteLine(asciiContent); // Output: Hello, ??! ?????????????! ?????!
// Notice that non-ASCII characters are replaced with ?

Creating a Multilingual User Interface

C# Unicode support makes it easy to build applications that work in any language:

csharp
// Simple multilingual greeting based on user locale
Dictionary<string, string> greetings = new Dictionary<string, string>
{
{ "en", "Welcome to our application!" },
{ "es", "¡Bienvenido a nuestra aplicación!" },
{ "ja", "アプリケーションへようこそ!" },
{ "ar", "مرحبا بكم في التطبيق لدينا!" },
{ "ru", "Добро пожаловать в наше приложение!" }
};

// Get user's culture (set to Spanish for this example)
string userCulture = "es";

// Display appropriate greeting
if (greetings.ContainsKey(userCulture))
{
Console.WriteLine(greetings[userCulture]);
}
else
{
Console.WriteLine(greetings["en"]); // Fall back to English
}
// Output: ¡Bienvenido a nuestra aplicación!

Processing Unicode Characters

Let's create a utility function to analyze the Unicode code points in a string:

csharp
static void AnalyzeString(string text)
{
Console.WriteLine($"String: {text}");
Console.WriteLine($"Visual length: {text.Length} char units");

Console.WriteLine("Unicode code points:");
for (int i = 0; i < text.Length; i++)
{
char c = text[i];

// Check if this is the start of a surrogate pair
if (char.IsHighSurrogate(c) && i + 1 < text.Length && char.IsLowSurrogate(text[i + 1]))
{
int codePoint = char.ConvertToUtf32(c, text[i + 1]);
Console.WriteLine($" U+{codePoint:X5} (surrogate pair at positions {i} & {i+1})");
i++; // Skip the low surrogate in the next iteration
}
else
{
Console.WriteLine($" U+{(int)c:X4} at position {i}");
}
}
}

// Let's test it
AnalyzeString("Hello");
AnalyzeString("こんにちは");
AnalyzeString("😊");

/* Output:
String: Hello
Visual length: 5 char units
Unicode code points:
U+0048 at position 0
U+0065 at position 1
U+006C at position 2
U+006C at position 3
U+006F at position 4

String: こんにちは
Visual length: 5 char units
Unicode code points:
U+3053 at position 0
U+3093 at position 1
U+306B at position 2
U+3061 at position 3
U+306F at position 4

String: 😊
Visual length: 2 char units
Unicode code points:
U+1F60A (surrogate pair at positions 0 & 1)
*/

Best Practices for Working with Unicode

  1. Always assume text can contain any Unicode character: Design your code to handle international text from the start.

  2. Use normalization when comparing strings: Call string.Normalize() before comparing strings that might have combining characters.

  3. Be careful with string length: Remember that string.Length returns the number of UTF-16 code units, not visual characters.

  4. Use proper encodings: Use UTF-8 for saving files and network communication in most cases.

  5. Consider culture in string operations: Sorting and case conversion can be culture-dependent:

csharp
using System.Globalization;

// Case conversion with culture
string text = "istanbul";
string upperInvariant = text.ToUpperInvariant();
string upperTurkish = text.ToUpper(CultureInfo.GetCultureInfo("tr-TR"));

Console.WriteLine(upperInvariant); // Output: ISTANBUL
Console.WriteLine(upperTurkish); // Output: İSTANBUL (notice the dotted İ)

Summary

C# provides excellent Unicode support through its string implementation, allowing you to:

  • Work with text in any language or writing system
  • Process Unicode characters with standard string operations
  • Handle different text encodings for file and network operations
  • Support international users in your applications

Understanding how C# implements Unicode strings is essential for developing applications that can be used globally and handle text data correctly.

Exercises

  1. Write a program that counts the number of actual visible characters in a string, accounting for surrogate pairs.
  2. Create a function that detects which languages might be present in a given string.
  3. Implement a simple text encoder that converts a string to and from different encodings (UTF-8, UTF-16, ASCII) and displays what happens to non-ASCII characters.
  4. Write a program that takes a string and reverses it properly, being careful with surrogate pairs and combining characters.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)