Unicode Character Encoding

This is a concise yet thorough description of Unicode used for Software Internationalization and Localization

Character	Unicode	Dec.	Rev.
a	\u0250	ɐ	ɐ
b	-	-	q
c	\u0254	ɔ	ɔ
e	\u01DD	ǝ	ǝ
f	\u025F	ɟ	ɟ
g	\u0183	ƃ	ƃ
h	\u0265	ɥ	ɥ
i	\u0131	ĭ	ĭ
j	\u027E	ɾ	ɾ
k	\u029E	ʞ	ʞ
l	\u05DF	ן	ן
m	\u026F	ɯ	ɯ
n	-	-	u
r	\u0279	ɹ	ɹ
t	\u0287	ʇ	ʇ
v	\u028C	ʌ	ʌ
w	\u028D	ʍ	ʍ
y	\u028E	ʎ	ʎ

Example of Issues With Unicode

This may be "useful" for people in the Southern Hemisphere (Australians, etc.)? (ha ha)

Unicode Generator

Upper-case letters are converted to lower-case letters (and numbers are not flipped) because the current Unicode set doesn't have upside down glyths for all capital letters (nor numbers).

The Unicode Standard

The Unicode Standard developed by the international Unicode Consortium defined hexidecimal code values (prefixed with U+) to consistently identify the world's glyphs (characters and symbols) assigned in ranges called code pages.

The range U+10000 to U+10FFFF is divided by Unicode 3.01 into 16 planes, only three of which have so far been used to encode supplementary characters used primarily to encode historical and classical literary documents from the rich heritage of the Chinese, Korean, and Japanese (Asian) languages.

Outside-the-computer, the format for Unicode is known as UTF (Unicode Transformation Formats) defined by IETF's RFC 3629. The ISO/IEC 10646 Annex D standard also uses the term "UCS Transformation Format" for UTF.

Presented in the sample below for the three UTF formats is the Greek capital letter Δ (Delta) from code page 1253.

Sample Format

SGML/HTML Entity Code Δ ASCII/ISO 8859-1 "Latin-1" character set is not Unicode. It is a fixed single byte 256 character set.

0xCD 0x94 UTF-8 data consists of a variable number of 8-bit single bytes. UTF uses as many bytes bytes as needed to encode a character. UTF-8 remains a simple, single-byte, ASCII-compatible encoding method for characters at or below 127 (which does not include the Euro currency € at 126, Pound currency £ at 163 nor the copyright © character at 169).
2 bytes is used for characters at or below 2047 (hex 0x07FF).
3 bytes is used for up to 65535 unique (mostly Asian) code points (nicknamed "magic numbers"). Windows 2000/XP/2003 are UTF-8 aware, so use of a UTF-8 storage format in the database requires many extra conversions. Although SQL Server 2005 does not store data in UTF-8 format, it supports UTF-8 for handling XML data.

0x0394 UTF-16 is a fixed width encoding form that uses 16-bit code units, like the older UCS-2 double-byte character set (DBCS) representations. UTF-16 is the default encoding form of the Unicode Standard -- the native string type for Java, Visual Basic, COM, and Windows NT/2000/XP/2003. The Windows Component Object Model (COM) supports only UTF-16/UCS-2 in its APIs and interfaces. The last two values FFFE₁₆ and FFFF₁₆ and the 32 values from FDD0₁₆ to FDEF₁₆ represent noncharacters. Very unusual characters are represented as surrogate pairs, which extend the character set to over a million characters.

0x00000394 UTF-32 is a fixed-width 32-bit encoding form, like UCS-4 (for 4 bytes). The major advantage of the encoding form is that it uniformly expresses all characters, so that they are easy to handle in arrays.

Sample	Format
SGML/HTML Entity Code Δ	ASCII/ISO 8859-1 "Latin-1" character set is not Unicode. It is a fixed single byte 256 character set.
0xCD 0x94	UTF-8 data consists of a variable number of 8-bit single bytes. UTF uses as many bytes bytes as needed to encode a character. UTF-8 remains a simple, single-byte, ASCII-compatible encoding method for characters at or below 127 (which does not include the Euro currency € at 126, Pound currency £ at 163 nor the copyright © character at 169). 2 bytes is used for characters at or below 2047 (hex 0x07FF). 3 bytes is used for up to 65535 unique (mostly Asian) code points (nicknamed "magic numbers"). Windows 2000/XP/2003 are UTF-8 aware, so use of a UTF-8 storage format in the database requires many extra conversions. Although SQL Server 2005 does not store data in UTF-8 format, it supports UTF-8 for handling XML data.
0x0394	UTF-16 is a fixed width encoding form that uses 16-bit code units, like the older UCS-2 double-byte character set (DBCS) representations. UTF-16 is the default encoding form of the Unicode Standard -- the native string type for Java, Visual Basic, COM, and Windows NT/2000/XP/2003. The Windows Component Object Model (COM) supports only UTF-16/UCS-2 in its APIs and interfaces. The last two values FFFE₁₆ and FFFF₁₆ and the 32 values from FDD0₁₆ to FDEF₁₆ represent noncharacters. Very unusual characters are represented as surrogate pairs, which extend the character set to over a million characters.
0x00000394	UTF-32 is a fixed-width 32-bit encoding form, like UCS-4 (for 4 bytes). The major advantage of the encoding form is that it uniformly expresses all characters, so that they are easy to handle in arrays.

Byte Order Mark (BOM) and Text Editors

To alert software to the fact that a file contains Unicode characters, the first bytes should contain a byte-order-mark (BOM) to declare that file's encoding format.

Bytes (in Hex)	Encoding Form
EF BB BF	UTF-8
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian

Microsoft's Notepad and many other text editors recognize this. But some older text editors ignore the BOM as a Zero Width Non-Breaking Space (ZWNBSP).

To display Unicode in Excel, specify the "Arial Unicode MS" or another Unicode capable font installed on your machine.

Encoding specifications need to be by some at the beginning of files because it controls how the rest of the file is handled.

The text editor is supposed to automatically insert this saving a file with Encoding selected at "Unicode".

Internet Explorer has logic to guess at the encoding, but this code helps pages load faster and more accurately:

<?xml version='1.0' encoding='UTF-8'?>

XML parsers require this encoding specification on the first line, such as:

WGL4 European Code Pages

Microsoft's Windows Glyph List 4 (WGL4) character set

code pages (also called Blocks)

WGL4 is called "Pan-European" because it covers several codepages used in Europe.

C strings in UTF-16

To use UTF-16 in C++, declare strings as data type wchar_t ("wide char") instead of char; and use the wcs functions instead of str functions. (For example, wcscat and wcslen instead of strcat and strlen).

To create a literal UCS-2 string in C code, put an L before it as so: L"Hello".

Java Unicode Text Divisions and Boundaries

Java

BreakIterator

Instances of BreakIterator are not created with a constructor, but with a static factory method for returning BreakIterator object for each type of textual element:

getCharacterInstance()
getWordInstance()
getLineInstance()
getSentenceInstance()

Lookup JavaDoc for "DataInputStream" about Java's variation of UTF-8.

Comparing Strings

SQL Unicode Coding

To create database tables in unicode format, use multi-lingual data types nvarchar, nchar and ntext.

A sample SQL INSERT query format:

INSERT INTO SomeMultiLangTable (userfname, userlname, userlangid) 
	VALUES(N'" + Request.Form["txtFName"] + "', +
	N'" + Request.Form["txtLName"] + "','" + 
              Request.QueryString["lang"] + "')";

A sample SQL query to retrieve data:

SELECT * FROM SomeMultiLangTable WHERE userfname=N'some Unicode Data'

International Features in Microsoft SQL Server 2005

Unicode Control Characters

IE7 enables you to click your way to inserting Unicode control charcters:

Resources

Joel on Software's article dated 2003 is my favorite historical introduction to character sets and Unicode. It's folksy and humorous, but still meaty with good examples.
Alan Wood's Unicode Resources: Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
i18nguy.com
A brief introduction to code pages and Unicode
"CJKV Information Processing" by Ken Lunde, O'Reilly & Associates, Inc. 1999, ISBN 1-56592-224-7
Unitype word processor
W3's Unicode in XML and other Markup Languages
W3's Character Model for the World Wide Web
HTMLHelp on the ISO 8859-1 character set
CJK Codes/Unicode Test Java applet at the UofA AI Lab

Books:

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam

Next: Internationalization

Your first name: [Alt+N]

Your family name: [Alt+F]

Your location (city, country): [Alt+L]

Your Email address: [Alt+Y]

Email me Updates [Alt+U]

[Alt+G] Top of Page

Thank you!