There are a few terms related to character code, such as character set, character code, charset, encoding, codeset, and so on. These words are explained later.
I recommend not to implement Unicode and UTF-8 directly. Instead, use locale technology and your software will support not only UTF-8 but also many encodings in the world. If you implement UTF-8 directly, your software can handle UTF-8 only. Such a software is not convenient.
This document used a word codeset before Novermber 2000 to call encoding. I changed terminology since I could not find a word codeset in documents written in English (I adopted this word from a book in Japanese). encoding seems more popular.
During I18N programming, we will frequently meet with EUC-JP or EUC-KR, while we well rarely meet with EUC. I think it is not appropriate to stress EUC, a class of encodings, over EUC-JP, EUC-KR, and so on, concrete encodings. It is just like regarding ISO 8859 as a concrete encoding, though ISO 8859 is a class of encodings of ISO 8859-{1,2,...,15}.
ISBN 1-56592-224-7, O'Reilly, 1999
though there are no existing encodings which is stateful and non-multibyte.
Compound text is a standard for text exchange between X clients.
WHAT IS THE VALUE OF THESE CONTROL CODES?
This is obviously not true for CNS 11643 because CNS 11643 contains 48711 characters while Unicode 3.0.1 contains 49194 characters, only 483 excess than CNS 11643.
Exactly speaking, u+000000 - u+10ffff.
Compare UTF and EUC. There are a few variants of EUC whose CCS are different (EUC-JP, EUC-KR, and so on). This is why we cannot call EUC as an encoding. In other words, calling of 'EUC' cannot specify an encoding. On the other hands, 'UTF-8' is the name for a specific concrete encoding.
I heard that BOM is mere a suggestion by a vendor. Read Markus Kuhn's UTF-8 and
Unicode FAQ for Unix/Linux
for detail.
XFree86 4.0 includes Japanese and Korean versions of ISO 10646-1 fonts.
I heard that Chinese and Korean people don't mind the glyph of these characters. If this is always true, Japanese glyphs should be the default glyphs for these problematic characters for international systems such as Debian.
There are a few projects such as Mojikyo
(about 90000 characters),
TRON project
(about
130000 characters), and so on to develop a CCS which contains sufficient
characters for professional usage in CJK world.
The standard encoding for Macintosh and MS Windows.
I HAVE TO SHOW EXAMPLE USING GRAPHICS.
Usage of UCS-4 is the second best solution for this problem. Sometimes LOCALE technology cannot be used and UCS-4 is the best. I will discuss this solution later.
There are a few exceptions. Compound text should be used for communication between X clients. UTF-8 would be the standard for file names in Linux.
Some of you may know GNU libc uses UCS-4 for the internal expression of wchar_t. However, you should not use the knowledge. It may differ in other systems.
In such a case, do they think of abolishing support of 7bit or 8bit non-multibyte encodings? If no, it may be unfair that 8bit language speakers can use both UTF-8 and conventional (local) encodings while speakers of multibyte languages, combining characters, and so on cannot use their popular locale encodings. I think such a software cannot be called "internationalized".
(Yes, there are ways to display Japanese characters correctly --
kon
(in kon2
package) for console and
kterm
for X, and Japanese people are happy with
gettext
ized Japanese messages.)
This section does not include problems on developing console; This section includes problems on developing softwares which run on console.
Though UTF-8 is an encoding with single CCS, the current version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.
IMHO, all users will have to set LANG properly when UTF-8 will become popular.
This is a field where proprietary systems such as MS Windows and Macintosh are much easier than free systems such as Debian and FreeBSD.
I HAVE TO WRITE EXPLANATION.
Read /usr/X11R6/lib/X11/locale/ja/XLC_LOCALE for detail.
In such a case, XCreateFontSet() does not fail. Instead, it returns informations on missing fonts.
I implemented cleverer mechanism to window managers such as Blackbox, Sawfish, and so on where I think beauty is important than simplicity. The intended algorithm is:
There are three popular codesets for Japanese --- ISO-2022-JP, EUC-JP, and SHIFT-JIS. EUC-JP should be used for perl source code because all non-ASCII characters in EUC-JP do not have values in 0x21 - 0x7e. However, ISO-2022-JP is the safest codeset to display because EUC-JP and SHIFT-JIS have to be used exclusively. However, ISO-2022-JP is the most difficult codeset to implement and there may be a terminal environment which does not understand ISO-2022-JP (for example, Minicom). On the other hand, dotfiles may be written in any codesets, according to one's favorite and purpose.
Introduction to i18n
14 February 2003kubota@debian.org