[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n
Chapter 4 - Coded Character Sets And Encodings in the World

Here major coded character sets and encodings are introduced. Note that you don't have to know the detail of these character codes if you use LOCALE and wchar_t technology.

However, these knowledge will help you to understand why number of bytes, characters, and columns should be counted separately, why strchr() and so on should not be used, why you should use LOCALE and wchar_t technology instead of hard-code processing of existing character codes, and so on so on.

These varieties of character sets and encodings will tell you about struggles of people in the world to handle their own languages by computers. Especially, CJK people could not help working out various technologies to use plenty of characters within ASCII-based computer systems.

If you are planning to develop a text-processing software beyond the fields which the LOCALE technology covers, you will have to understand the following descriptions very well. These fields include automatic detection of encodings used for the input file (Most of Japanese-capable text viewers such as jless and lv have this mechanism) and so on.

4.1 ASCII and ISO 646

ASCII is a CCS and also an encoding at the same time. ASCII is 7bit and contains 94 printable characters which are encoded in the region of 0x21-0x7e.

ISO 646 is the international standard of ASCII. Following 12 characters of

0x23 (number),

0x24 (dollar),

0x40 (at),

0x5b (left square bracket),

0x5c (backslash),

0x5d (right square bracket),

0x5e (caret),

0x60 (backquote),

0x7b (left curly brace),

0x7c (vertical line),

0x7d (right curly brace), and

0x7e (tilde)

are called IRV (International Reference Version) and other 82 (94 - 12 = 82) characters are called BCT (Basic Code Table). Characters at IRV can be different between countries. Here is a few examples of versions of ISO 646.

UK version (BS 4730)

US version (ASCII): 0x23 is pound currency mark, and so on.

Japanese version (JISX 0201 Roman): 0x5c is yen currency mark, and so on.

Italian version (UNI 0204-70): 0x7b is 'a' with grave accent, and so on.

French version (NF Z 62-010): 0x7b is 'e' with acute accent, and so on.

As far as I know, all encodings (besides EBCDIC) in the world are compatible with ISO 646.

Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.

Nowadays usage of encodings incompatible with ASCII is not encouraged and thus ISO 646-* (other than US version) should not be used. One of the reason is that when a string is converted into Unicode, the converter doesn't know whether IRVs are converted into characters with same shapes or characters with same codes. Another reason is that source codes are written in ASCII. Source code must be readable anywhere.

4.2 ISO 8859

ISO 8859 is both a series of CCS and a series of encodings. It is an expansion of ASCII using all 8 bits. Additional 96 printable characters encoded in 0xa0 - 0xff are available besides 94 ASCII printable characters.

There are 10 variants of ISO 8859 (in 1997).

ISO-8859-1 Latin alphabet No.1 (1987): characters for western European languages

ISO-8859-2 Latin alphabet No.2 (1987): characters for central European languages

ISO-8859-3 Latin alphabet No.3 (1988)
ISO-8859-4 Latin alphabet No.4 (1988): characters for northern European languages

ISO-8859-5 Latin/Cyrillic alphabet (1988)
ISO-8859-6 Latin/Arabic alphabet (1987)
ISO-8859-7 Latin/Greek alphabet (1987)
ISO-8859-8 Latin/Hebrew alphabet (1988)
ISO-8859-9 Latin alphabet No.5 (1989): same as ISO-8859-1 except for Turkish instead of Icelandic

ISO-8859-10 Latin alphabet No.6 (1993): Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4

ISO-8859-11 Latin/Thai alphabet (2001): same as TIS-620 Thai national standard

ISO-8859-13 Latin alphabet No.7 (1998)
ISO-8859-14 Latin alphabet No.8 (Celtic) (1998)
ISO-8859-15 Latin alphabet No.9 (1999)
ISO-8859-16 Latin alphabet No.10 (2001)

A detailed explanation is found at http://park.kiev.ua/mutliling/ml-docs/iso-8859.html.

4.3 ISO 2022

Using ASCII and ISO 646, we can use 94 characters at most. Using ISO 8859, the number includes to 190 (= 94 + 96). However, we may want to use much more characters. Or, we may want to use some, not one, of these character sets. One of the answer is ISO 2022.

ISO 2022 is an international standard of CES. ISO 2022 determines a few requirement for CCS to be a member of ISO 2022-based encodings. It also defines a very extensive (and complex) rules to combine these CCS into one encoding. Many encodings such as EUC-*, ISO 2022-*, compound text, [7] and so on can be regarded as subsets of ISO 2022. ISO 2022 is so complex that you may be not able to understand this. It is OK; What is important here is the concept of ISO 2022 of building an encoding by switching various (ISO 2022-compliant) coded character sets.

The sixth edition of ECMA-35 is fully identical with ISO 2022:1994 and you can find the official document at http://www.ecma.ch/ecma1/stand/ECMA-035.HTM.

ISO 2022 has two versions of 7bit and 8bit. At first 8bit version is explained. 7bit version is a subset of 8bit version.

The 8bit code space is divided into four regions,

0x00 - 0x1f: C0 (Control Characters 0),

0x20 - 0x7f: GL (Graphic Characters Left),

0x80 - 0x9f: C1 (Control Characters 1), and

0xa0 - 0xff: GR (Graphic Characters Right).

GL and GR is the spaces where (printable) character sets are mapped.

Next, all character sets, for example, ASCII, ISO 646-UK, and JIS X 0208, are classified into following four categories,

(1) character set with 1-byte 94-character,

(2) character set with 1-byte 96-character,

(3) character set with multibyte 94-character, and

(4) character set with multibyte 96-character.

Characters in character sets with 94-character are mapped into 0x21 - 0x7e. Characters in 96-character set are mapped into 0x20 - 0x7f.

For example, ASCII, ISO 646-UK, and JISX 0201 Katakana are classified into (1), JISX 0208 Japanese Kanji, KSX 1001 Korean, GB 2312-80 Chinese are classified into (3), and ISO 8859-* are classified to (2).

The mechanism to map these character sets into GL and GR is a bit complex. There are four buffers, G0, G1, G2, and G3. A character set is designated into one of these buffers and then a buffer is invoked into GL or GR.

Control sequences to 'designate' a character set into a buffer are determined as below.

A sequence to designate a character set with 1-byte 94-character
- into G0 set is: ESC 0x28 F,
- into G1 set is: ESC 0x29 F,
- into G2 set is: ESC 0x2a F, and
- into G3 set is: ESC 0x2b F.

A sequence to designate a character set with 1-byte 96-character
- into G1 set is: ESC 0x2d F,
- into G2 set is: ESC 0x2e F, and
- into G3 set is: ESC 0x2f F.

A sequence to designate a character set with multibyte 94-character
- into G0 set is: ESC 0x24 0x28 F (exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.),
- into G1 set is: ESC 0x24 0x29 F,
- into G2 set is: ESC 0x24 0x2a F, and
- into G3 set is: ESC 0x24 0x2b F.

A sequence to designate a character set with multibyte 96-character
- into G1 set is: ESC 0x24 0x2d F,
- into G2 set is: ESC 0x24 0x2e F, and
- into G3 set is: ESC 0x24 0x2f F.

where 'F' is determined for each character set:

character set with 1-byte 94-character
- F=0x40 for ISO 646 IRV: 1983
- F=0x41 for BS 4730 (UK)
- F=0x42 for ANSI X3.4-1968 (ASCII)
- F=0x43 for NATS Primary Set for Finland and Sweden
- F=0x49 for JIS X 0201 Katakana
- F=0x4a for JIS X 0201 Roman (Latin)
- and more

character set with 1-byte 96-character
- F=0x41 for ISO 8859-1 Latin-1
- F=0x42 for ISO 8859-2 Latin-2
- F=0x43 for ISO 8859-3 Latin-3
- F=0x44 for ISO 8859-4 Latin-4
- F=0x46 for ISO 8859-7 Latin/Greek
- F=0x47 for ISO 8859-6 Latin/Arabic
- F=0x48 for ISO 8859-8 Latin/Hebrew
- F=0x4c for ISO 8859-5 Latin/Cyrillic
- and more

character set with multibyte 94-character
- F=0x40 for JISX 0208-1978 Japanese
- F=0x41 for GB 2312-80 Chinese
- F=0x42 for JISX 0208-1983 Japanese
- F=0x43 for KSC 5601 Korean
- F=0x44 for JISX 0212-1990 Japanese
- F=0x45 for CCITT Extended GB (ISO-IR-165)
- F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
- F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
- F=0x49 for CNS 11643-1992 Set 3 (Taiwan)
- F=0x4a for CNS 11643-1992 Set 4 (Taiwan)
- F=0x4b for CNS 11643-1992 Set 5 (Taiwan)
- F=0x4c for CNS 11643-1992 Set 6 (Taiwan)
- F=0x4d for CNS 11643-1992 Set 7 (Taiwan)
- and more

The complete list of these coded character set is found at International Register of Coded Character Sets.

Control codes to 'invoke' one of G{0123} into GL or GR is determined as below.

A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)

A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In)

A control code to invoke G2 into GL is: LS2 (Locking Shift 2)

A control code to invoke G3 into GL is: LS3 (Locking Shift 3)

A control code to invoke one character in G2 into GL is: SS2 (Single Shift 2)

A control code to invoke one character in G3 into GL is: SS3 (Single Shift 3)

A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)

A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right)

A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right)

[8]

Note that a code in a character set invoked into GR is or-ed with 0x80.

ISO 2022 also determines announcer code. For example, 'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already invoked into GL'. This simplify the coding system. Even this announcer can be omitted if people who exchange data agree.

7bit version of ISO 2022 is a subset of 8bit version. It does not use C1 and GR.

Explanation on C0 and C1 is omitted here.

4.3.1 EUC (Extended Unix Code)

EUC is a CES which is a subset of 8bit version of ISO 2022 except for the usage of SS2 and SS3 code. Though these codes are used to invoke G2 and G3 into GL in ISO 2022, they are invoked into GR in EUC. EUC-JP, EUC-KR, EUC-CN, and EUC-TW are widely used encodings which use EUC as CES.

EUC is stateless.

EUC can contain 4 CCS by using G0, G1, G2, and G3. Though there is no requirement that ASCII is designated to G0, I don't know any EUC codeset in which ASCII is not designated to G0.

For EUC with G0-ASCII, all codes other than ASCII are encoded in 0x80 - 0xff and this is upward compatible to ASCII.

Expressions for characters in G0, G1, G2, and G3 character sets are described below in binary:

G0: 0???????

G1: 1??????? [1??????? [...]]

G2: SS2 1??????? [1??????? [...]]

G3: SS3 1??????? [1??????? [...]]

where SS2 is 0x8e and SS3 is 0x8f.

4.3.2 ISO 2022-compliant Character Sets

There are many national and international standards of coded character sets (CCS). Some of them are ISO 2022-compliant and can be used in ISO 2022 encoding.

ISO 2022-compliant CCS are classified into one of them:

94 characters

96 characters

94x94x94x... characters

The most famous 94 character set is US-ASCII. Also, all ISO 646 variants are ISO 2022-compliant 94 character sets.

All ISO 8859-* character sets are ISO 2022-compliant 96 character sets.

There are many 94x94 character sets. All of them are related to CJK ideograms.

JISX 0208 (aka JIS C 6226): National standard of Japan. 1978 version contains 6802 characters including Kanji (ideogram), Hiragana, Katakana, Latin, Greek, Cyrillic, numeric, and other symbols. The current (1997) version contains 7102 characters.

JISX 0212: National standard of Japan. 6067 characters (almost of them are Kanji). This character set is intended to be used in addition to JISX 0208.

JISX 0213: Japanese national standard. Released in 2000. This includes JISX 0208 characters and additional thousands of characters. Thus, this is intended to be an extension and a replacement of JISX 0208. This has two 94x94 character sets, one of them inclucdes JISX 0208 plus about 2000 characters and the another includes about 2400 characters. Exactly speaking, JISX 0213 is not a simple superset of JISX 0208 because a few tens of Kanji variants which is unified and share the same code points in JISX 0208 are dis-unified and have separate code points in JISX 0213. Share many characters with JISX 0212.

KSX 1001 (aka KSC 5601): National standard of South Korea. 8224 characters including 2350 Hangul, Hanja (ideogram), Hiragana, Katakana, Latin, Greek, Cyrillic, and other symbils. Hanja are ordered in reading and Hanja with multiple readings are coded multiple times.

KSX 1002: National standard of South Korea. 7659 characters including Hangul and Hanja. Intended to be used in addition to KSX 1001.

KPS 9566: National standard of North Korea. Similar to KSX 1001.

GB 2312: National standard of China. 7445 characters including 6763 Hanzi (ideogram), Latin, Greek, Cyrillic, Hiragana, Katakana, and other symbols.

GB 7589 (aka GB2): National standard of China. 7237 Hanzi. Intended to be used in addition to GB 2312.

GB 7590 (aka GB4): National standard of China. 7039 Hanzi. Intended to be used in addition to GB 2312 and GB 7589.

GB 12345 (aka GB/T 12345, GB1 or GBF): National standard of China. 7583 characters. Traditional characters version which correspond to GB 2312 simplified characters.

GB 13131 (aka GB3): National standard of China. Traditional characters version which correspond to GB 7589 simplified characters.

GB 13132 (aka GB5): National standard of China. Traditional characters version which correspond to GB 7590 simplified characters.

CNS 11643: National standard of Taiwan. Has 7 plains. Plain 1 and 2 includes all characters included in Big5. Plain 1 includes 6085 characters including Hanzi (ideogram), Latin, Greek, and other symbols. Plain 2 includes 7650. Number of character for plain 3 is 6184, plain 4 is 7298, plain 5 is 8603, plain 6 is 6388, and plain 7 is 6539.

There is a 94x94x94 character set. This is CCCII. This is national standard of Taiwan. Now 73400 characters are included. (The number is increasing.)

Non-ISO 2022-compliant character sets are introduced later in Other Character Sets and Encodings, Section 4.5.

4.3.3 ISO 2022-compliant Encodings

There are many ISO 2022-compliant encodings which are subsets of ISO 2022.

Compound Text: This is used for X clients to communicate each other, for example, copy-paste.

EUC-JP: An EUC encoding with ASCII, JISX 0208, JISX 0201 Kana, and JISX 0212 coded character sets. There are many systems which does not support JISX 0201 Kana and JISX 0212. Widely used in Japan for POSIX systems.

EUC-KR: An EUC encoding with ASCII and KSX 1001.

CN-GB (aka EUC-CN): An EUC encoding with ASCII and GB 2312. The most popular encoding in R. P. China. This encoding is sometimes referred as simply 'GB'.

EUC-TW: An extended EUC encoding with ASCII, CNS 11643 plain 1, and other (2-7) plains of CNS 11643.

ISO 2022-JP: Described in. RFC 1468.
***** Not written yet *****

ISO 2022-JP-1 (upward compatible to ISO 2022-JP): Described in RFC 2237.
***** Not written yet *****

ISO 2022-JP-2 (upward compatible to ISO 2022-JP-1): Described in RFC 1554.
***** Not written yet *****

ISO 2022-KR: aka Wansung. Described in RFC 1557.
***** Not written yet *****

ISO 2022-CN: Described in RFC RFC 1922.
***** Not written yet *****

Non-ISO 2022-compliant encodings are introduced later in Other Character Sets and Encodings, Section 4.5.

4.4 ISO 10646 and Unicode

ISO 10646 and Unicode are an another standard so that we can develop international softwares easily. The special features of this new standard are:

A united single CCS which intends to include all characters in the world. (ISO 2022 consists of multiple CCS.)

The character set intends to cover all conventional (or legacy) CCS in the world. [9]

Compatibility with ASCII and ISO 8859-1 is considered.

Chinese, Japanese, and Korean ideograms are united. This comes from a limitation of Unicode. This is not a merit.

ISO 10646 is an official international standard. Unicode is developed by Unicode Consortium. These two are almost identical. Indeed, these two are exactly identical at code points which are available in both two standards. Unicode is sometimes updated and the newest version is 3.0.1.

4.4.1 UCS as a Coded Character Set

ISO 10646 defines two CCS (coded character sets), UCS-2 and UCS-4. UCS-2 is a subset of UCS-4.

UCS-4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits and each of them has special term.

The top 7 bits are called Group.

Next 8 bits are called Plane.

Next 8 bits are Row.

The smallest 8 bits are Cell.

The first plane (Group = 0, Plane = 0) is called BMP (Basic Multilingual Plane) and UCS-2 is same to BMP. Thus, UCS-2 is a 16bit CCS.

Code points in UCS are often expressed as u+????, where ???? is hexadecimal expression of the code point.

Characters in range of u+0021 - u+007e are same to ASCII and characters in range of u+0xa0 - u+0xff are same to ISO 8859-1. Thus it is very easy to convert between ASCII or ISO 8859-1 and UCS.

Unicode (version 3.0.1) uses a 20bit subset of UCS-4 as a CCS. [10]

The unique feature of these CCS compared with other CCS is open repertoire. They are developing even after they are released. Characters will be added in future. However, already coded characters will not changed. Unicode version 3.0.1 includes 49194 distinct coded characters.

4.4.2 UTF as Character Encoding Schemes

A few CES are used to construct encodings which use UCS as a CCS. They are UTF-7, UTF-8, UTF-16, UTF-16LE, and UTF-16BE. UTF means Unicode (or UCS) Transformation Format. Since these CES always take UCS as the only CCS, they are also names for encodings. [11]

4.4.2.1 UTF-8

UTF-8 is an encoding whose CCS is UCS-4. UTF-8 is designed to be upward-compatible to ASCII. UTF-8 is multibyte and number of bytes needed to express one character is from 1 to 6.

Conversion from UCS-4 to UTF-8 is performed using a simple conversion rule.

     UCS-4 (binary)                       UTF-8 (binary)
     00000000 00000000 00000000 0???????  0???????
     00000000 00000000 00000??? ????????  110????? 10??????
     00000000 00000000 ???????? ????????  1110???? 10?????? 10??????
     00000000 000????? ???????? ????????  11110??? 10?????? 10?????? 10??????
     000000?? ???????? ???????? ????????  111110?? 10?????? 10?????? 10?????? 10??????
     0??????? ???????? ???????? ????????  1111110? 10?????? 10?????? 10?????? 10?????? 10??????

Note the shortest one will be used though longer representation can express smaller UCS values.

UTF-8 seems to be one of the major candidates for standard codesets in the future. For example, Linux console and xterm supports UTF-8. Debian package of locales (version 2.1.97-1) contains ko_KR.UTF-8 locale. I think the number of UTF-8 locale will increase.

4.4.2.2 UTF-16

UTF-16 is an encoding whose CCS is 20bit Unicode.

Characters in BMP are expressed using 16bit value of code point in Unicode CCS. There are two ways to express 16bit value in 8bit stream. Some of you may heard a word endian. Big endian means an arrangement of octets which are part of a datum with many bits from most significant octet to least significant one. Little endian is opposite. For example, 16bit value of 0x1234 is expressed as 0x12 0x34 in big endian and 0x34 0x12 in little endian.

UTF-16 supports both endians. Thus, Unicode character of u+1234 can be expressed either in 0x12 0x34 or 0x34 0x12. Instead, the UTF-16 texts have to have a BOM (Byte Order Mark) at first of them. The Unicode character u+feff zero width no-break space is called BOM when it is used to indicate the byte order or endian of texts. The mechanism is easy: in big endian, u+feff will be 0xfe 0xff while it will be 0xff 0xfe in little endian. Thus you can understand the endian of the text by reading the first two bytes. [12]

Characters not included in BMP are expressed using surrogate pair. Code points of u+d800 - u+dfff are reserved for this purpose. At first, 20 bits of Unicode code point are divided into two sets of 10 bits. The significant 10 bits are mapped to 10bit space of u+d800 - u+dbff. The smaller 10 bits are mapped to 10bit space of u+dc00 - u+dfff. Thus UTF-16 can express 20bit Unicode characters.

4.4.2.3 UTF-16BE and UTF-16LE

UTF-16BE and UTF-16LE are variants of UTF-16 which are limited to big and little endians, respectively.

4.4.2.4 UTF-7

UTF-7 is designed so that Unicode can be communicated using 7bit communication path.

***** Not written yet *****

4.4.2.5 UCS-2 and UCS-4 as encodings

Though I introduced UCS-2 and UCS-4 are CCS, they can be encodings.

In UCS-2 encoding, Each UCS-2 character is expressed in two bytes. In UCS-4 encoding, Each UCS-4 character is expressed in four bytes.

4.4.3 Problems on Unicode

All standards are not free from politics and compromise. Though a concept of united single CCS for all characters in the world is very nice, Unicode had to consider compatibility with preceding international and local standards. And more, unlike the ideal concept, Unicode people considered efficiency too much. IMHO, surrogate pair is a mess caused by lack of 16bit code space. I will introduce a few problems on Unicode.

4.4.3.1 Han Unification

This is the point on which Unicode is criticized most strongly among many Japanese people.

A region of 0x4e00 - 0x9fff in UCS-2 is used for Eastern-Asian ideographs (Japanese Kanji, Chinese Hanzi, and Korean Hanja). There are similar characters in these four character sets. (There are two sets of Chinese characters, simplified Chinese used in P. R. China and traditional Chinese used in Taiwan). To reduce the number of these ideograms to be encoded (the region for these characters can contain only 20992 characters while only Taiwan CNS 11643 standard contains 48711 characters), these similar characters are assumed to be the same. This is Han Unification.

However these characters are not exactly the same. If fonts for these characters are made from Chinese one, Japanese people will regard them wrong characters, though they may be able to read. Unicode people think these united characters are the same character with different glyphs.

An example of Han Unification is available at U+9AA8. This is a Kanji character for 'bone'. U+8FCE is an another example of a Kanji character for 'welcome'. The part from left side to bottom side is 'run' radical. 'Run' radical is used for many Kanjis and all of them have the same problem. U+76F4 is an another example of a Kanji character for 'straight'. I, a native Japanese speaker, cannot recognize Chiense version at all.

Unicode font vendors will hesitate to choose fonts for these characters, simplified Chinese character, traditional Chinese one, Japanese one, or Korean one. One method is to supply four fonts of simplified Chinese version, traditional Chinese version, Japanese version, and Korean version. Commercial OS vendor can release localized version of their OS --- for example, Japanese version of MS Windows can include Japanese version of Unicode font (this is what they are exactly doing). However, how should XFree86 or Debian do? I don't know... [13] [14]

4.4.3.2 Cross Mapping Tables

Unicode intents to be a superset of all major encodings in the world, such as ISO-8859-*, EUC-*, KOI8-*, and so on. The aim of this is to keep round-trip compatibility and to enable smooth migration from other encodings to Unicode.

Only providing a superset is not sufficient. Reliable cross mapping tables between Unicode and other encodings are needed. They are provided by Unicode Consortium.

However, tables for East Asian encodings are not provided. They were provided but now are obsolete.

You may want to use these mapping tables even though they are obsolete, because there are no other mapping tables available. However, you will find a severe problem for these tables. There are multiple different mapping tables for Japanese encodings which include JIS X 0208 character set. Thus, one same character in JIS X 0208 will be mapped into different Unicode characters according to these mapping tables. For example, Microsoft and Sun use different table, which results in Java on MS Windows sometimes break Japanese characters.

Though we Open Source people should respect interoperativity, we cannot achieve sufficient interoperativity because of this problem. All what we can achieve is interoperativity between Open Source softwares.

GNU libc uses JIS/JIS0208.TXT with a small modification. The modification is that

original JIS0208.TXT: 0x815F 0x2140 0x005C # REVERSE SOLIDUS

modified: 0x815F 0x2140 0xFF3C # FULLWIDTH REVERSE SOLIDUS

The reason of this modification is that JIS X 0208 character set is almost always used with combination with ASCII in form of EUC-JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should be mapped into U+005C. This modified table is found at /usr/share/i18n/charmaps/EUC-JP.gz in Debian system. Of course this mapping table is NOT authorized nor reliable.

I hope Unicode Consortium to release an authorized reliable unique mapping table between Unicode and JIS X 0208. You can read the detail of this problem.

4.4.3.3 Combining Characters

Unicode has a way to synthesize a accented character by combining an accent symbol and a base character. For example, combining 'a' and '~' makes 'a' with tilde. More than two accent symbol can be added to a base character.

Languages such as Thai need combining characters. Combining characters are the only method to express characters in these languages.

However, a few problems arises.

Duplicate Encoding: There are multiple ways to express the same character. For example, u with umlaut can be expressed as u+00fc and also as u+0075 + U+0308. How can we implement 'grep' and so on?

Open Repertoire: Number of expressible characters grows unlimitedly. Non-existing characters can be expressed.

4.4.3.4 Surrogate Pair

The first version of Unicode had only 16bit code space, though 16bit is obviously insufficient to contain all characters in the world. [15] Thus surrogate pair is introduced in Unicode 2.0, to expand the number of characters, with keeping compatibility with former 16bit Unicode.

However, surrogate pair breaks the principle that all characters are expressed with the same width of bits. This makes Unicode programming more difficult.

Fortunately, Debian and other UNIX-like systems will use UTF-8 (not UTF-16) as a usual encoding for UCS. Thus, we don't need to handle UTF-16 and surrogate pair very often.

4.4.3.5 ISO 646-* Problem

You will need a codeset converter between your local encodings (for example, ISO 8859-* or ISO 2022-*) and Unicode. For example, Shift-JIS encoding [16] consists from JISX 0201 Roman (Japanese version of ISO 646), not ASCII, which encodes yen currency mark at 0x5c where backslash is encoded in ASCII.

Then which should your converter convert 0x5c in Shift-JIS into in Unicode, u+005c (backslash) or u+00a5 (yen currency mark)? You may say yen currency mark is the right solution. However, backslash (and then yen mark) is widely used for escape character. For example, 'new line' is expressed as 'backslash - n' in C string literal and Japanese people use 'yen currency mark - n'. You may say that program sources must written in ASCII and the wrong point is that you tried to convert program source. However, there are many source codes and so on written in Shift-JIS encoding.

Now Windows comes to support Unicode and the font at u+005c for Japanese version of Windows is yen currency mark. As you know, backslash (yen currency mark in Japan) is vitally important for Windows, because it is used to separate directory names. Fortunately, EUC-JP, which is widely used for UNIX in Japan, includes ASCII, not Japanese version of ISO 646. So this is not problem because it is clear 0x5c is backslash.

Thus all local codesets should not use character sets incompatible to ASCII, such as ISO 646-*.

Problems and Solutions for Unicode and User/Vendor Defined Characters discusses on this problem.

4.5 Other Character Sets and Encodings

Besides ISO 2022-compliant coded character sets and encodings described in ISO 2022-compliant Character Sets, Section 4.3.2 and ISO 2022-compliant Encodings, Section 4.3.3, there are many popular encodings which cannot be classified into an international standard (i.e., not ISO 2022-compliant nor Unicode). Internationalized softwares should support these encodings (again, you don't need to be aware of encodings if you use LOCALE and wchar_t technology). Some organizations are developing systems which go father than limitations of the current international standards, though these systems may be not diffused very much so far.

4.5.1 Big5

Big5 is a de-facto standard encoding for Taiwan (1984) and is upward-compatible with ASCII. It is also a CCS.

In Big5, 0x21 - 0x7e means ASCII characters. 0xa1 - 0xfe makes a pair with the following byte (0x40 - 0x7e and 0xa1 - 0xfe) and means an ideogram and so on (13461 characters).

Though Taiwan has ISO 2022-compliant new standard CNS 11643, Big5 seems to be more popular than CNS 11643. (CNS 11643 is a CCS and there are a few ISO 2022-derived encodings which include CNS 11643.)

4.5.2 UHC

UHC is an encoding which is an upward-compatible with EUC-KR. Two-byte characters (the first byte: 0x81 - 0xfe; the second byte: 0x41 - 0x5a, 0x61 - 0x7a, and 0x81 - 0xfe) include KSX 1001 and other Hangul so that UHC can express all 11172 Hangul.

4.5.3 Johab

Johab is an encoding whose character set is identical with UHC, i.e., ASCII, KSX 1001, and all other Hangul character. Johab means combination in Korean. In Johab, code point of a Hangul can be calculated from combination of Hangul parts (Jamo).

4.5.4 HZ, aka HZ-GB-2312

HZ is an encoding described in RFC 1842. CCS (Coded character sets) of HZ is ASCII and GB2312. This is 7bit encoding.

Note that HZ is not upward-compatible with ASCII, since '~{' means GB2312 mode, '~}' means ASCII mode, and '~~' means ASCII '~'.

4.5.5 GBK

GBK is an encoding which is upward-compatible to CN-GB. GBK covers ASCII, GB2312, other Unicode 1.0 ideograms, and a bit more. The range of two-byte characters in GBK is: 0x81 - 0xfe for the first byte and 0x40 - 0x7e and 0x80 - 0xfe for the second byte. 21886 code points out of 23940 in two-byte region are defined.

GBK is one of popular encodings in R. P. China.

4.5.6 GB18030

GB 18030 is an encoding which is upward-compatible to GBK and CN-GB. It is an recent national standard (released on 17 March 2000) of China. It adds four-byte characters to GBK. Its range is: 0x81 - 0xfe for the first byte, 0x30 - 0x39 for the second byte, 0x81 - 0xfe for the third byte, and 0x30 - 0x39 for the forth byte.

It includes all characters of Unicode 3.0's Unihan Extension A. And more, GB 18030 supplies code space for all used and unused code points of Unicode's plane 0 (BMP) and 16 additional planes.

A detailed explanation on GB18030 is available.

4.5.7 GCCS

GCCS is a standard of coded character set by Hong Kong (HKSAR: Hong Kong Special Administrative Region). It includes 3049 characters. It is an abbreviation of Government Common Character Set. It is defined as an additional character set for Big5. Characters in GCCS are coded in User-Defined Area (just like Private Use Area for UCS) in Big5.

4.5.8 HKSCS

HKSCS is an expansion and amendment of GCCS. It includes 4702 characters. It means Hong Kong Supplementary Character Set.

In addition to a usage in User-Defined Area in Big5, HKSCS defines a usage in Private Use Area in Unicode.

4.5.9 Shift-JIS

Shift-JIS is one of popular encodings in Japan. Its CCS are JISX 0201 Roman, JISX 0201 Kana, and JISX 0208.

JISX 0201 Roman is Japanese version of ISO 646. It defines yen currency mark for 0x5c, where ASCII has backslash. 0xa1 - 0xdf is one-byte character and is JISX 0201 Kana. Two-byte character (the first byte: 0x81 - 0x9f and 0xe0 - 0xef; the second byte: 0x40 - 0x7e and 0x80 - 0xfc) is JISX 0208.

Japanese version of MS DOS, MS Windows and Macintosh use this encoding, though this encoding is not often used in POSIX systems.

4.5.10 VISCII

Vietnamese language uses 186 characters (Latin alphabets with accents) and other symbols. It is a bit more than the limit of ISO 8859-like encoding.

VISCII is a standard for Vietnamese. It is upward-compatible with ASCII. It is 8bit and stateless, like ISO 8859 series. However, it uses code points of not only 0x21 - 0x7e and 0xa0 - 0xff but also 0x02, 0x05, 0x06, 0x14, 0x19, 0x1e, and 0x80 - 0x9f. This makes VISCII not-ISO 2022-compliant.

Vietnam has a new, ISO 2022-compliant character set TCVN 5712 VN2 (aka VSCII). In TCVN 5712 VN2, accented characters are expressed as a combined character. Note that some of accented characters have their own code points.

4.5.11 TRON

TRON is a project to develop a new operating system, founded as a collaboration of industries and academics in Japan since 1984.

The most diffused version of TRON operating system families is ITRON, a real-time OS for embedded systems. However, our interest is not on ITRON now. TRON determines a TRON encoding.

TRON's encoding is stateful. Each state is assigned to each language. It has already defined about 130000 characters (January 2000).

4.5.12 Mojikyo

Mojikyo is a project to develop an environment by which a user can use many characters in the world. Mojikyo project has released an application software for MS Windows to display and input about 90000 characters. You can download the software and TrueType, TeX, and CID fonts, though they are not DFSG-free.