How I may help
LinkedIn Profile Email me!

Reload this page Unicode Character Encoding

This is a concise yet thorough description of Unicode used for another page on this site Software Internationalization and Localization

 

Site Map List all pages on this site 
About this site About this site 
Go to first topic Go to Bottom of this page


Set screen Unicode

    The Unicode Standard developed by the international Unicode Consortium defined hexidecimal code values (prefixed with U+) to consistently identify the world's glyphs (characters and symbols) assigned in ranges called code pages.

    The range U+0000 to U+FFFF, called the Basic Multilingual Plane (BMP), was specified by Unicode 2.0 to specify 65,536 of the most commonly used glyphs.
    The range U+10000 to U+10FFFF is divided into 16 planes, only three of which have so far been used to encode characters.

    Outside-the-computer, the format for Unicode is known as UTF (Unicode Transformation Formats) defined by IETF's RFC 3629. The ISO/IEC 10646 Annex D standard also uses the term "UCS transformation format" for UTF.

    Presented in the sample below for the three UTF formats is the Greek capital letter Δ (Delta) from code page 1253.

Sample Format
SGML/HTML Entity Codeanother page on this site Δ ASCII/ISO 8859-1 "Latin-1" character set is not Unicode. It is a fixed single byte 256 character set.
0xCD 0x94 UTF-8 data consists of a variable number of 8-bit single bytes. UTF uses as many bytes bytes as needed to encode a character. UTF-8 remains a simple, single-byte, ASCII-compatible encoding method for characters at or below 127 (which does not include the Euro currency € at 126, Pound currency £ at 163 nor the copyright © character at 169).
2 bytes is used for characters at or below 2047 (hex 0x07FF).
3 bytes is used for up to 65535 unique (mostly Asian) code points (nicknamed "magic numbers").
0x0394 UTF-16 is a fixed width encoding form that uses 16-bit code units, like the older UCS-2 double-byte character set (DBCS) representations. UTF-16 is the default encoding form of the Unicode Standard -- the native string type for Java, Visual Basic, COM, and Windows NT/2000/XP. 65,000 addresses can be defined in 16-bit units. The last two values FFFE16 and FFFF16 and the 32 values from FDD016 to FDEF16 represent noncharacters. Very unusual characters are represented as surrogate pairs, which extend the character set to over a million characters.
0x00000394 UTF-32 is a fixed-width 32-bit encoding form, like UCS-4 (for 4 bytes). The major advantage of the encoding form is that it uniformly expresses all characters, so that they are easy to handle in arrays.

Set screen Byte Order Mark (BOM)

Set screen C strings in UTF-16

Set screen Resources

Next: Internationalization  Next Topic

How I may help

Send a message with your email client program


Your rating of this page:
Low High





Your first name:

Your family name:

Your location (city, country):

Your Email address:



  Top of Page Go to top of page

Thank you!