Here major coded character sets and encodings are introduced. Note that you don't have to know the detail of these character codes if you use LOCALE and wchar_t technology.
However, these knowledge will help you to understand why number of bytes, characters, and columns should be counted separately, why strchr() and so on should not be used, why you should use LOCALE and wchar_t technology instead of hard-code processing of existing character codes, and so on so on.
These varieties of character sets and encodings will tell you about struggles of people in the world to handle their own languages by computers. Especially, CJK people could not help working out various technologies to use plenty of characters within ASCII-based computer systems.
If you are planning to develop a text-processing software beyond the fields
which the LOCALE technology covers, you will have to understand the following
descriptions very well. These fields include automatic detection of encodings
used for the input file (Most of Japanese-capable text viewers such as
jless
and lv
have this mechanism) and so on.
ASCII is a CCS and also an encoding at the same time. ASCII is 7bit and contains 94 printable characters which are encoded in the region of 0x21-0x7e.
ISO 646 is the international standard of ASCII. Following 12 characters of
are called IRV (International Reference Version) and other 82 (94 - 12 = 82) characters are called BCT (Basic Code Table). Characters at IRV can be different between countries. Here is a few examples of versions of ISO 646.
As far as I know, all encodings (besides EBCDIC) in the world are compatible with ISO 646.
Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
Nowadays usage of encodings incompatible with ASCII is not encouraged and thus ISO 646-* (other than US version) should not be used. One of the reason is that when a string is converted into Unicode, the converter doesn't know whether IRVs are converted into characters with same shapes or characters with same codes. Another reason is that source codes are written in ASCII. Source code must be readable anywhere.
ISO 8859 is both a series of CCS and a series of encodings. It is an expansion of ASCII using all 8 bits. Additional 96 printable characters encoded in 0xa0 - 0xff are available besides 94 ASCII printable characters.
There are 10 variants of ISO 8859 (in 1997).
A detailed explanation is found at http://park.kiev.ua/mutliling/ml-docs/iso-8859.html
.
Using ASCII and ISO 646, we can use 94 characters at most. Using ISO 8859, the number includes to 190 (= 94 + 96). However, we may want to use much more characters. Or, we may want to use some, not one, of these character sets. One of the answer is ISO 2022.
ISO 2022 is an international standard of CES. ISO 2022 determines a few requirement for CCS to be a member of ISO 2022-based encodings. It also defines a very extensive (and complex) rules to combine these CCS into one encoding. Many encodings such as EUC-*, ISO 2022-*, compound text, [7] and so on can be regarded as subsets of ISO 2022. ISO 2022 is so complex that you may be not able to understand this. It is OK; What is important here is the concept of ISO 2022 of building an encoding by switching various (ISO 2022-compliant) coded character sets.
The sixth edition of ECMA-35 is fully identical with ISO 2022:1994 and you can
find the official document at http://www.ecma.ch/ecma1/stand/ECMA-035.HTM
.
ISO 2022 has two versions of 7bit and 8bit. At first 8bit version is explained. 7bit version is a subset of 8bit version.
The 8bit code space is divided into four regions,
GL and GR is the spaces where (printable) character sets are mapped.
Next, all character sets, for example, ASCII, ISO 646-UK, and JIS X 0208, are classified into following four categories,
Characters in character sets with 94-character are mapped into 0x21 - 0x7e. Characters in 96-character set are mapped into 0x20 - 0x7f.
For example, ASCII, ISO 646-UK, and JISX 0201 Katakana are classified into (1), JISX 0208 Japanese Kanji, KSX 1001 Korean, GB 2312-80 Chinese are classified into (3), and ISO 8859-* are classified to (2).
The mechanism to map these character sets into GL and GR is a bit complex. There are four buffers, G0, G1, G2, and G3. A character set is designated into one of these buffers and then a buffer is invoked into GL or GR.
Control sequences to 'designate' a character set into a buffer are determined as below.
where 'F' is determined for each character set:
The complete list of these coded character set is found at International Register of Coded
Character Sets
.
Control codes to 'invoke' one of G{0123} into GL or GR is determined as below.
[8]
Note that a code in a character set invoked into GR is or-ed with 0x80.
ISO 2022 also determines announcer code. For example, 'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already invoked into GL'. This simplify the coding system. Even this announcer can be omitted if people who exchange data agree.
7bit version of ISO 2022 is a subset of 8bit version. It does not use C1 and GR.
Explanation on C0 and C1 is omitted here.
EUC is a CES which is a subset of 8bit version of ISO 2022 except for the usage of SS2 and SS3 code. Though these codes are used to invoke G2 and G3 into GL in ISO 2022, they are invoked into GR in EUC. EUC-JP, EUC-KR, EUC-CN, and EUC-TW are widely used encodings which use EUC as CES.
EUC is stateless.
EUC can contain 4 CCS by using G0, G1, G2, and G3. Though there is no requirement that ASCII is designated to G0, I don't know any EUC codeset in which ASCII is not designated to G0.
For EUC with G0-ASCII, all codes other than ASCII are encoded in 0x80 - 0xff and this is upward compatible to ASCII.
Expressions for characters in G0, G1, G2, and G3 character sets are described below in binary:
where SS2 is 0x8e and SS3 is 0x8f.
There are many national and international standards of coded character sets (CCS). Some of them are ISO 2022-compliant and can be used in ISO 2022 encoding.
ISO 2022-compliant CCS are classified into one of them:
The most famous 94 character set is US-ASCII. Also, all ISO 646 variants are ISO 2022-compliant 94 character sets.
All ISO 8859-* character sets are ISO 2022-compliant 96 character sets.
There are many 94x94 character sets. All of them are related to CJK ideograms.
There is a 94x94x94 character set. This is CCCII. This is national standard of Taiwan. Now 73400 characters are included. (The number is increasing.)
Non-ISO 2022-compliant character sets are introduced later in Other Character Sets and Encodings, Section 4.5.
There are many ISO 2022-compliant encodings which are subsets of ISO 2022.
RFC
1468
.
***** Not written yet *****
RFC
2237
.
***** Not written yet *****
RFC
1554
.
***** Not written yet *****
RFC 1557
.
***** Not written yet *****
RFC
1922
.
***** Not written yet *****
Non-ISO 2022-compliant encodings are introduced later in Other Character Sets and Encodings, Section 4.5.
ISO 10646 and Unicode are an another standard so that we can develop international softwares easily. The special features of this new standard are:
ISO 10646 is an official international standard. Unicode is developed by
Unicode Consortium
. These
two are almost identical. Indeed, these two are exactly identical at code
points which are available in both two standards. Unicode is sometimes updated
and the newest version is 3.0.1.
ISO 10646 defines two CCS (coded character sets), UCS-2 and UCS-4. UCS-2 is a subset of UCS-4.
UCS-4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits and each of them has special term.
The first plane (Group = 0, Plane = 0) is called BMP (Basic Multilingual Plane) and UCS-2 is same to BMP. Thus, UCS-2 is a 16bit CCS.
Code points in UCS are often expressed as u+????, where ???? is hexadecimal expression of the code point.
Characters in range of u+0021 - u+007e are same to ASCII and characters in range of u+0xa0 - u+0xff are same to ISO 8859-1. Thus it is very easy to convert between ASCII or ISO 8859-1 and UCS.
Unicode (version 3.0.1) uses a 20bit subset of UCS-4 as a CCS. [10]
The unique feature of these CCS compared with other CCS is open repertoire. They are developing even after they are released. Characters will be added in future. However, already coded characters will not changed. Unicode version 3.0.1 includes 49194 distinct coded characters.
A few CES are used to construct encodings which use UCS as a CCS. They are UTF-7, UTF-8, UTF-16, UTF-16LE, and UTF-16BE. UTF means Unicode (or UCS) Transformation Format. Since these CES always take UCS as the only CCS, they are also names for encodings. [11]
UTF-8 is an encoding whose CCS is UCS-4. UTF-8 is designed to be upward-compatible to ASCII. UTF-8 is multibyte and number of bytes needed to express one character is from 1 to 6.
Conversion from UCS-4 to UTF-8 is performed using a simple conversion rule.
UCS-4 (binary) UTF-8 (binary) 00000000 00000000 00000000 0??????? 0??????? 00000000 00000000 00000??? ???????? 110????? 10?????? 00000000 00000000 ???????? ???????? 1110???? 10?????? 10?????? 00000000 000????? ???????? ???????? 11110??? 10?????? 10?????? 10?????? 000000?? ???????? ???????? ???????? 111110?? 10?????? 10?????? 10?????? 10?????? 0??????? ???????? ???????? ???????? 1111110? 10?????? 10?????? 10?????? 10?????? 10??????
Note the shortest one will be used though longer representation can express smaller UCS values.
UTF-8 seems to be one of the major candidates for standard codesets in the
future. For example, Linux console and xterm supports UTF-8. Debian package
of locales
(version 2.1.97-1) contains ko_KR.UTF-8
locale. I think the number of UTF-8 locale will increase.
UTF-16 is an encoding whose CCS is 20bit Unicode.
Characters in BMP are expressed using 16bit value of code point in Unicode CCS. There are two ways to express 16bit value in 8bit stream. Some of you may heard a word endian. Big endian means an arrangement of octets which are part of a datum with many bits from most significant octet to least significant one. Little endian is opposite. For example, 16bit value of 0x1234 is expressed as 0x12 0x34 in big endian and 0x34 0x12 in little endian.
UTF-16 supports both endians. Thus, Unicode character of u+1234 can be expressed either in 0x12 0x34 or 0x34 0x12. Instead, the UTF-16 texts have to have a BOM (Byte Order Mark) at first of them. The Unicode character u+feff zero width no-break space is called BOM when it is used to indicate the byte order or endian of texts. The mechanism is easy: in big endian, u+feff will be 0xfe 0xff while it will be 0xff 0xfe in little endian. Thus you can understand the endian of the text by reading the first two bytes. [12]
Characters not included in BMP are expressed using surrogate pair. Code points of u+d800 - u+dfff are reserved for this purpose. At first, 20 bits of Unicode code point are divided into two sets of 10 bits. The significant 10 bits are mapped to 10bit space of u+d800 - u+dbff. The smaller 10 bits are mapped to 10bit space of u+dc00 - u+dfff. Thus UTF-16 can express 20bit Unicode characters.
UTF-16BE and UTF-16LE are variants of UTF-16 which are limited to big and little endians, respectively.
UTF-7 is designed so that Unicode can be communicated using 7bit communication path.
***** Not written yet *****
Though I introduced UCS-2 and UCS-4 are CCS, they can be encodings.
In UCS-2 encoding, Each UCS-2 character is expressed in two bytes. In UCS-4 encoding, Each UCS-4 character is expressed in four bytes.
All standards are not free from politics and compromise. Though a concept of united single CCS for all characters in the world is very nice, Unicode had to consider compatibility with preceding international and local standards. And more, unlike the ideal concept, Unicode people considered efficiency too much. IMHO, surrogate pair is a mess caused by lack of 16bit code space. I will introduce a few problems on Unicode.
This is the point on which Unicode is criticized most strongly among many Japanese people.
A region of 0x4e00 - 0x9fff in UCS-2 is used for Eastern-Asian ideographs (Japanese Kanji, Chinese Hanzi, and Korean Hanja). There are similar characters in these four character sets. (There are two sets of Chinese characters, simplified Chinese used in P. R. China and traditional Chinese used in Taiwan). To reduce the number of these ideograms to be encoded (the region for these characters can contain only 20992 characters while only Taiwan CNS 11643 standard contains 48711 characters), these similar characters are assumed to be the same. This is Han Unification.
However these characters are not exactly the same. If fonts for these characters are made from Chinese one, Japanese people will regard them wrong characters, though they may be able to read. Unicode people think these united characters are the same character with different glyphs.
An example of Han Unification is available at U+9AA8
.
This is a Kanji character for 'bone'. U+8FCE
is an another example of a Kanji character for 'welcome'. The part from left
side to bottom side is 'run' radical. 'Run' radical is used for many Kanjis
and all of them have the same problem. U+76F4
is an another example of a Kanji character for 'straight'. I, a native
Japanese speaker, cannot recognize Chiense version at all.
Unicode font vendors will hesitate to choose fonts for these characters, simplified Chinese character, traditional Chinese one, Japanese one, or Korean one. One method is to supply four fonts of simplified Chinese version, traditional Chinese version, Japanese version, and Korean version. Commercial OS vendor can release localized version of their OS --- for example, Japanese version of MS Windows can include Japanese version of Unicode font (this is what they are exactly doing). However, how should XFree86 or Debian do? I don't know... [13] [14]
Unicode intents to be a superset of all major encodings in the world, such as ISO-8859-*, EUC-*, KOI8-*, and so on. The aim of this is to keep round-trip compatibility and to enable smooth migration from other encodings to Unicode.
Only providing a superset is not sufficient. Reliable cross mapping tables
between Unicode and other encodings are needed. They are provided by Unicode Consortium
.
However, tables for East Asian encodings are not provided. They were provided
but now are obsolete
.
You may want to use these mapping tables even though they are obsolete, because there are no other mapping tables available. However, you will find a severe problem for these tables. There are multiple different mapping tables for Japanese encodings which include JIS X 0208 character set. Thus, one same character in JIS X 0208 will be mapped into different Unicode characters according to these mapping tables. For example, Microsoft and Sun use different table, which results in Java on MS Windows sometimes break Japanese characters.
Though we Open Source people should respect interoperativity, we cannot achieve sufficient interoperativity because of this problem. All what we can achieve is interoperativity between Open Source softwares.
GNU libc uses JIS/JIS0208.TXT
with a small modification. The modification is that
The reason of this modification is that JIS X 0208 character set is almost always used with combination with ASCII in form of EUC-JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should be mapped into U+005C. This modified table is found at /usr/share/i18n/charmaps/EUC-JP.gz in Debian system. Of course this mapping table is NOT authorized nor reliable.
I hope Unicode Consortium to release an authorized reliable unique mapping
table between Unicode and JIS X 0208. You can read the detail of this
problem
.
Unicode has a way to synthesize a accented character by combining an accent symbol and a base character. For example, combining 'a' and '~' makes 'a' with tilde. More than two accent symbol can be added to a base character.
Languages such as Thai need combining characters. Combining characters are the only method to express characters in these languages.
However, a few problems arises.
The first version of Unicode had only 16bit code space, though 16bit is obviously insufficient to contain all characters in the world. [15] Thus surrogate pair is introduced in Unicode 2.0, to expand the number of characters, with keeping compatibility with former 16bit Unicode.
However, surrogate pair breaks the principle that all characters are expressed with the same width of bits. This makes Unicode programming more difficult.
Fortunately, Debian and other UNIX-like systems will use UTF-8 (not UTF-16) as a usual encoding for UCS. Thus, we don't need to handle UTF-16 and surrogate pair very often.
You will need a codeset converter between your local encodings (for example, ISO 8859-* or ISO 2022-*) and Unicode. For example, Shift-JIS encoding [16] consists from JISX 0201 Roman (Japanese version of ISO 646), not ASCII, which encodes yen currency mark at 0x5c where backslash is encoded in ASCII.
Then which should your converter convert 0x5c in Shift-JIS into in Unicode, u+005c (backslash) or u+00a5 (yen currency mark)? You may say yen currency mark is the right solution. However, backslash (and then yen mark) is widely used for escape character. For example, 'new line' is expressed as 'backslash - n' in C string literal and Japanese people use 'yen currency mark - n'. You may say that program sources must written in ASCII and the wrong point is that you tried to convert program source. However, there are many source codes and so on written in Shift-JIS encoding.
Now Windows comes to support Unicode and the font at u+005c for Japanese version of Windows is yen currency mark. As you know, backslash (yen currency mark in Japan) is vitally important for Windows, because it is used to separate directory names. Fortunately, EUC-JP, which is widely used for UNIX in Japan, includes ASCII, not Japanese version of ISO 646. So this is not problem because it is clear 0x5c is backslash.
Thus all local codesets should not use character sets incompatible to ASCII, such as ISO 646-*.
Problems and
Solutions for Unicode and User/Vendor Defined Characters
discusses
on this problem.
Besides ISO 2022-compliant coded character sets and encodings described in ISO 2022-compliant Character Sets, Section 4.3.2 and ISO 2022-compliant Encodings, Section 4.3.3, there are many popular encodings which cannot be classified into an international standard (i.e., not ISO 2022-compliant nor Unicode). Internationalized softwares should support these encodings (again, you don't need to be aware of encodings if you use LOCALE and wchar_t technology). Some organizations are developing systems which go father than limitations of the current international standards, though these systems may be not diffused very much so far.
Big5 is a de-facto standard encoding for Taiwan (1984) and is upward-compatible with ASCII. It is also a CCS.
In Big5, 0x21 - 0x7e means ASCII characters. 0xa1 - 0xfe makes a pair with the following byte (0x40 - 0x7e and 0xa1 - 0xfe) and means an ideogram and so on (13461 characters).
Though Taiwan has ISO 2022-compliant new standard CNS 11643, Big5 seems to be more popular than CNS 11643. (CNS 11643 is a CCS and there are a few ISO 2022-derived encodings which include CNS 11643.)
UHC is an encoding which is an upward-compatible with EUC-KR. Two-byte characters (the first byte: 0x81 - 0xfe; the second byte: 0x41 - 0x5a, 0x61 - 0x7a, and 0x81 - 0xfe) include KSX 1001 and other Hangul so that UHC can express all 11172 Hangul.
Johab is an encoding whose character set is identical with UHC, i.e., ASCII, KSX 1001, and all other Hangul character. Johab means combination in Korean. In Johab, code point of a Hangul can be calculated from combination of Hangul parts (Jamo).
HZ is an encoding described in RFC 1842
. CCS (Coded
character sets) of HZ is ASCII and GB2312. This is 7bit encoding.
Note that HZ is not upward-compatible with ASCII, since '~{' means GB2312 mode, '~}' means ASCII mode, and '~~' means ASCII '~'.
GBK is an encoding which is upward-compatible to CN-GB. GBK covers ASCII, GB2312, other Unicode 1.0 ideograms, and a bit more. The range of two-byte characters in GBK is: 0x81 - 0xfe for the first byte and 0x40 - 0x7e and 0x80 - 0xfe for the second byte. 21886 code points out of 23940 in two-byte region are defined.
GBK is one of popular encodings in R. P. China.
GB 18030 is an encoding which is upward-compatible to GBK and CN-GB. It is an recent national standard (released on 17 March 2000) of China. It adds four-byte characters to GBK. Its range is: 0x81 - 0xfe for the first byte, 0x30 - 0x39 for the second byte, 0x81 - 0xfe for the third byte, and 0x30 - 0x39 for the forth byte.
It includes all characters of Unicode 3.0's Unihan Extension A. And more, GB 18030 supplies code space for all used and unused code points of Unicode's plane 0 (BMP) and 16 additional planes.
A
detailed explanation on GB18030
is available.
GCCS is a standard of coded character set by Hong Kong (HKSAR: Hong Kong Special Administrative Region). It includes 3049 characters. It is an abbreviation of Government Common Character Set. It is defined as an additional character set for Big5. Characters in GCCS are coded in User-Defined Area (just like Private Use Area for UCS) in Big5.
HKSCS is an expansion and amendment of GCCS. It includes 4702 characters. It means Hong Kong Supplementary Character Set.
In addition to a usage in User-Defined Area in Big5, HKSCS defines a usage in Private Use Area in Unicode.
Shift-JIS is one of popular encodings in Japan. Its CCS are JISX 0201 Roman, JISX 0201 Kana, and JISX 0208.
JISX 0201 Roman is Japanese version of ISO 646. It defines yen currency mark for 0x5c, where ASCII has backslash. 0xa1 - 0xdf is one-byte character and is JISX 0201 Kana. Two-byte character (the first byte: 0x81 - 0x9f and 0xe0 - 0xef; the second byte: 0x40 - 0x7e and 0x80 - 0xfc) is JISX 0208.
Japanese version of MS DOS, MS Windows and Macintosh use this encoding, though this encoding is not often used in POSIX systems.
Vietnamese language uses 186 characters (Latin alphabets with accents) and other symbols. It is a bit more than the limit of ISO 8859-like encoding.
VISCII is a standard for Vietnamese. It is upward-compatible with ASCII. It is 8bit and stateless, like ISO 8859 series. However, it uses code points of not only 0x21 - 0x7e and 0xa0 - 0xff but also 0x02, 0x05, 0x06, 0x14, 0x19, 0x1e, and 0x80 - 0x9f. This makes VISCII not-ISO 2022-compliant.
Vietnam has a new, ISO 2022-compliant character set TCVN 5712 VN2 (aka VSCII). In TCVN 5712 VN2, accented characters are expressed as a combined character. Note that some of accented characters have their own code points.
TRON
is a project
to develop a new operating system, founded as a collaboration of industries and
academics in Japan since 1984.
The most diffused version of TRON operating system families is ITRON, a real-time OS for embedded systems. However, our interest is not on ITRON now. TRON determines a TRON encoding.
TRON's encoding is stateful. Each state is assigned to each language. It has already defined about 130000 characters (January 2000).
Mojikyo
is a project to
develop an environment by which a user can use many characters in the world.
Mojikyo project has released an application software for MS Windows to display
and input about 90000 characters. You can download the software and TrueType,
TeX, and CID fonts, though they are not DFSG-free.
Introduction to i18n
14 February 2003kubota@debian.org