[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n
Chapter 7 - Output to Display

Here 'Output to Display' does not mean translation of messages using gettext. I will concern on whether characters are correctly displayed so that we can read it. For example, install libcanna1g package and display /usr/doc/libcanna1g/README.jp.gz on console or xterm (of course after ungzipping). This text file is written in Japanese but even Japanese people can not read such a row of strange characters. Which you would prefer if you were a Japanese speaker, an English message which can be read with a dictionary or such a row of strange characters which is a result of gettextization? [22]

Problems on displaying non-English (non-ASCII) characters are discussed below.

7.1 Console Softwares

In this section, problems on displaying characters on console are discussed. [23] Here, console includes a bare Linux console including framebuffer and conventional version, special consoles such as kon2, jfbterm, chdrv, and so on constructed by special softwares, and X terminal emulators such as xterm, kterm, hanterm, xiterm, rxvt, xvt, gnome-terminal, wterm, aterm, eterm, and so on. Remote environments via telnet and secure shell such as NCSA telnet for Macintosh and Tera Term for Windows are also regarded as consoles.

The feature of console is that:

All what a software has to do is to send a correct encoding to standard output. Softwares on console don't need to care about fonts and so on.

Fonts with fixed sizes are used. The unit of the width of the font is called 'column'. 'Doublewidth' fonts, i.e., fonts whose width is 2 columns, are used for CJK ideograms, Japanese Hiragana and Katakana, Korean Hangul, and related symbols. Combined characters used for Thai and so on can be regarded as 'zero'-column characters.

7.1.1 Encoding

Softwares running on the console are not responsible for displaying. The console itself is responsible. There are consoles which can display encodings other than ASCII such as

kon in kon2 package: EUC-JP, Shift-JIS, and ISO-2022-JP

jfbterm: EUC-JP, ISO 2022-JP, and ISO 2022 (including any 94, 96, and 94x94 coded character sets whose fonts are available)

kterm: EUC-JP, Shift-JIS, ISO 2022-JP, and ISO 2022 (including ISO8859-{1,2,3,4,5,6,7,8,9}, JISX 0201, JISX 0208, JISX 0212, GB 2312, and KSC 5601)

krxvt in rxvt-ml package: EUC-JP

crxvt-gb in rxvt-ml package: CN-GB

crxvt-big5 in rxvt-ml package: Big5

cxtermb5 in cxterm-big5 package: Big5

xcinterm-big5 in xcin package: Big5

xcinterm-gb in xcin package: CN-GB

xcinterm-gbk in xcin package: GBK

xcinterm-big5hkscs in xcin package: Big5 with HKSCS

hanterm: EUC-KR, Johab, and ISO 2022-KR

xiterm and txiterm in xiterm+thai package: TIS 620

xterm: UTF-8

However, there are no way for a software on console to know which encoding is available. I think it is a responsibility for a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG environmental variable). Provided LC_CTYPE locale is set properly, a software can use it to know which encoding to be supported by the console.

Concerning the translated messages by gettext, the software does not need anything. It works well if the user properly set LC_CTYPE and LC_MESSAGES locale.

If you are handling a string in non-ASCII encoding (using multibyte character, UTF-8 directly, and so on), you will have to care about points which you don't have to care about if you are using ASCII.

8-bit cleanness. I think everyone understand this.

Continuity of multibyte characters. In multibyte encodings such as EUC-JP and UTF-8, one character may consist from more than two bytes. These bytes should be outputed continued. Insertion of additional codes between the continuing bytes can break the character. I have seen a software which outputs location control code everytime it outputs one byte. It breaks multibyte character.

7.1.2 Number of Columns

Internationalized console software cannot assume that a character always occupy one column. You can get the number of column of a character of a string using wcwidth() and wcswidth(). Note that you have to use wchar_t-style programming since these functions have a wchar_t parameter.

Additional cares have to be taken not to destroy multicolumn characters. For example, imagine your software displayed a double-column character at (row, column) = (1, 1). What will occur when your software then display a single-column character at (row, column) = (1, 2) or at (1, 1) ? The single-column character erases the half of the double-column character? Nobody knows the answer. It depends on the implementation of the console. All what I can tell is that your software should avoid such cases.

If your software inputs a string from keyboard, you will have to take more cares. All of numbers of characters, bytes, and columns differ. For example, in UTF-8 encoding, one character of 'a' with acute accent occupies two bytes and one column. One character of CJK-ideograph occupies three bytes and two columns. For example, if the user types 'Backspace', how many backspace code (0x08) should the software outputs? How many bytes should the software erase from the internal buffer? Don't be nervous; you can use wchar_t which assures one character occupy one wchar_t everytime and you can use wcwidth() to know the number of columns. Note that control codes such as 'backspace' (0x08) and so on are column-oriented everytime. It backs 'one' column even if the character at the position is a doublewidth character.

7.2 X Clients

The way to develop X clients can differ drastically dependent on the toolkits to be used. At first, Xlib-style programming is discussed since Xlib is the fundamental for all other toolkits. Then a few toolkits are discussed.

7.2.1 Xlib programming

X itself is already internationalized. X11R5 has introduced an idea of 'fontset' for internationalized text output. Thus all what X clients have to do is to use the 'fontset'-related functions.

The most important part for internationalization of displaying for X clients is the usage of internationalized XFontSet-related functions introduced since X11R5 instead of conventional XFontStruct-related functions.

The main feature of XFontSet is that it can handle multiple fonts at the same time. This is related to the distinction between coded character set (CCS) and character encoding scheme (CES) which I wrote at the section of Basic Terminology, Section 3.1. Some encodings in the world use multiple coded character sets at the same time. This is the reason we have to handle multiple X fonts at the same time. [24]

Another significant feature of XFontSet is that it is locale (LC_CTYPE)-sensible. This means that you have to call setlocale() before you use XFontSet-related functions. And more, you have to specify the string you want to draw as a multibyte character or a wide character.

In the conventional XFontStruct model, an X client opens a font using XLoadQueryFont(), draw a string using XDrawString(), and close the font using XFreeFont(). On the other hand, in the internationalized XFontSet model, an X client opens a font using XCreateFontSet(), draw a string using XmbDrawString(), and close the font using XFreeFontSet(). The following are a concise list of substitution.

XFontStruct -> XFontSet

XLoadQueryFont() -> XCreateFontSet()

both of XDrawString() and XDrawString16 -> either of XmbDrawString() or XwcDrawString()

both of XDrawImageString() and XDrawImageString16 -> either of XmbDrawImageString() or XwcDrawImageString()

Note that XFontStruct is usually used as a pointer, while XFontSet itself is a pointer.

Some people (ISO-8859-1-language speakers) may think that XFontSet-related functions are not 8-bit clean. This is wrong. XFontSet-related functions work according to LC_CTYPE locale. The default LC_CTYPE locale uses ASCII. Thus, if a user doesn't set LANG, LC_CTYPE, nor LC_ALL environmental variable, XFontSet-related functions will use ASCII, i.e., not 8-bit clean. The user has to set LANG, LC_CTYPE, or LC_ALL environmental variable properly (for example, LANG=en_US).

The upstream developers of X clients sometimes hate to enforce users to set such environmental variables. [25] In such a case, The X clients should have two ways to output text, i.e., XFontStruct-related conventional way and XFontSet-related internationalized way. If setlocale() returns NULL, "C", or "POSIX", use XFontStruct way. Otherwise use XFontSet way. The author implemented this algorithm to a few window managers such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0), sawmill (0.28), and so on.

Window managers need more modifications related to inter-clients communication. This topic will be described later.

7.2.2 Athena widgets

Athena widget is already internationalized.

***** Not written yet *****

7.2.3 Gtk and Gnome

Gtk is already internationalized.

***** Not written yet *****

7.2.4 Qt and KDE

Though internationalized version of Qt was available for a long time, it could not be the official version of Qt. The license of Qt of those days inhibited to distribute internationalized version of Qt. However, Troll Tech at last changed their mind and Qt's license and now the official version of Qt is internationalized.

***** Not written yet *****

[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n

14 February 2003
Tomohiro KUBOTA kubota@debian.org

Introduction to i18n Chapter 7 - Output to Display

7.1 Console Softwares

7.1.1 Encoding

7.1.2 Number of Columns

7.2 X Clients

7.2.1 Xlib programming

7.2.2 Athena widgets

7.2.3 Gtk and Gnome

7.2.4 Qt and KDE

Introduction to i18n
Chapter 7 - Output to Display