|
|
|
|
| Sample | Format |
|---|---|
| SGML/HTML Entity Code | ASCII/ISO 8859-1 "Latin-1" character set is not Unicode. It is a fixed single byte 256 character set. |
| 0xCD 0x94 | UTF-8 data consists of a
variable number of 8-bit single bytes.
UTF uses as many bytes bytes as needed to encode a character.
UTF-8 remains a simple, single-byte, ASCII-compatible encoding method for characters at or below 127
(which does not include the Euro currency at 126, Pound currency £ at 163
nor the copyright © character at 169).
2 bytes is used for characters at or below 2047 (hex 0x07FF). 3 bytes is used for up to 65535 unique (mostly Asian) code points (nicknamed "magic numbers"). |
| 0x0394 | UTF-16 is a fixed width encoding form that uses 16-bit code units, like the older UCS-2 double-byte character set (DBCS) representations. UTF-16 is the default encoding form of the Unicode Standard -- the native string type for Java, Visual Basic, COM, and Windows NT/2000/XP. 65,000 addresses can be defined in 16-bit units. The last two values FFFE16 and FFFF16 and the 32 values from FDD016 to FDEF16 represent noncharacters. Very unusual characters are represented as surrogate pairs, which extend the character set to over a million characters. |
| 0x00000394 | UTF-32 is a fixed-width 32-bit encoding form, like UCS-4 (for 4 bytes).
The major advantage of the encoding form is that it uniformly expresses all characters, so that they are easy to handle in
arrays .
|
Microsoft's Notepad and many other text editors recognize a file as being encoded in UTF-8 if the first bytes of the file contains a byte-order-mark (BOM) to declare that file's encoding format.
| Bytes (in Hex) | Encoding Form |
|---|---|
| EF BB BF | UTF-8 |
| 00 00 FE FF | UTF-32, big-endian |
| FF FE 00 00 | UTF-32, little-endian |
| FE FF | UTF-16, big-endian |
| FF FE | UTF-16, little-endian |
Encoding specifications need to be by some
at the beginning of files because it controls how the rest of the file is handled.
Others ignore the BOM as a Zero Width Non-Breaking Space (ZWNBSP),
Internet Explorer has logic to guess at the encoding, but this code helps pages load faster and more accurately:
It is automatically inserted when a file is saved as with Encoding selected at "Unicode".
XML parsers require an encoding specification on the first line, such as:
WGL4 is called "Pan-European" because it covers several codepages used in Europe.
To use UTF-16 in C++, declare strings as wchar_t ("wide char") instead of char; and use the wcs functions instead of the str functions. (For example, wcscat and wcslen instead of strcat and strlen).
To create a literal UCS-2
string
in C code, put an L before it as so: L"Hello".
Instances of BreakIterator are not created with a constructor, but with a static factory method for returning BreakIterator object for each type of textual element:
Lookup JavaDoc for "DataInputStream" about Java's variation of UTF-8.
Unicode Generator (0 through 65000)
Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam
|
Next: Internationalization |
| Your first name: Your family name: Your location (city, country): Your Email address: |
Top of Page Thank you! | |||