[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n
Chapter 2 - Introduction

2.1 General Concepts

Debian includes many pieces of software. Though many of them have the ability to process, input, and output text data, some of these programs assume text is written in English (ASCII). For people who use non-English languages, these programs are barely usable. And more, though many softwares can handle not only ASCII but also ISO-8859-1, some of them cannot handle multibyte characters for CJK (Chinese, Japanese, and Korean) languages, nor combined characters for Thai.

So far, people who use non-English languages have given up using their native languages and have accepted computers as they were. However, we should now forget such a wrong idea. It is absurd that a person who wants to use a computer has to learn English in advance.

I18N is needed in the following places.

Displaying characters for the users' native languages.

Inputing characters for the users' native languages.

Handling files written in popular encodings [1] that are used for the users' native languages.

Using characters from the users' native languages for file names and other items.

Printing out characters from the users' native languages.

Displaying messages by the program in the users' native languages.

Formatting input and output of numbers, dates, money, etc., in a way that obeys customs of the users' native cultures.

Classifying and sorting characters, in a way that obey customs of the users' native cultures.

Using typesetting and hyphenation rules appropriate for the users' native languages.

This document puts emphasis on the first three items. This is because these three items are the basis for the other items. An another reason is that you cannot use softwares lacking the first three items at all, while you can use softwares lacking the other items, albeit inconveniently. This document will also mention translation of messages (item 6) which is often called as 'I18N'. Note that the author regards the terminology of 'I18N' for calling translation and gettextization as completely wrong. The reason may be well explained by the fact that the author did not include translation and gettextization in the important first three items.

Imagine a word processor which can display error and help messages in your native language while cannot process your native language. You will easily understand that the word processor is not usable. On the other hand, a word processor which can process your native language, but only displays error and help messages in English, is usable, though it is not convenient. Before we think of developing convenient softwares, we have to think of developing usable softwares.

The following terminology is widely used.

I18N (internationalization) means modification of a software or related technologies so that a software can potentially handle multiple languages, customs, and so on in the world.

L10N (localization) means implementation of a specific language for an already internationalized software.

However, this terminology is valid only for one specific model out of a few models which we should consider for I18N. Now I will introduce a few models other than this I18N-L10N model.

a. L10N (localization) model: This model is to support two languages or character codes, English (ASCII) and another specific one. Examples of softwares which is developed using this model are: Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual Emacs) text editor which can input and output Japanese text files, and Hanterm X terminal emulator which can display and input Korean characters via a few Korean encodings. Since each programmer has his or her own mother tongue, there are numerous L10N patches and L10N programs written to satisfy his or her own need.

b. I18N (internationalization) model: This model is to support many languages but only two of them, English (ASCII) and another one, at the same time. One have to specify the 'another' language, usually by LANG environmental variable. The above I18N-L10N model can be regarded as a part of this I18N model. gettextization is categorized into I18N model.

c. M17N (multilingualization) model: This model is to support many languages at the same time. For example, Mule (MULtilingual Enhancement to GNU Emacs) can handle a text file which contains multiple languages - for example, a paper on differences between Korean and Chinese whose main text is written in Finnish. GNU Emacs 20 and XEmacs now include Mule. Note that the M17N model can only be applied in character-related instances. For example, it is nonsense to display a message like 'file not found' in many languages at the same time. Unicode and UTF-8 are technologies which can be used for this model. [2]

Generally speaking, the M17N model is the best and the second-best is the I18N model. The L10N model is the worst and you should not use it except for a few fields where the I18N and M17N models are very difficult, like DTP and X terminal emulator. In other words, it is better for text-processing softwares to handle many languages at the same time, than handle two (English and another language).

Now let me classify approaches for support of non-English languages from another viewpoint.

A. Implementation without knowledge of each language: This approach is done by utilizing standardized methods supplied by the kernel or libraries. The most important one is locale technology which includes locale category, conversion between multibyte and wide characters (wchar_t), and so on. Another important technology is gettext. The advantages of this approach are (1) that when the kernel or libraries are upgraded, the software will automatically support new additional languages, (2) that programmers need not know each language, and (3) that a user can switch the behavior of softwares with common method, like LANG variable. The disadvantage is that there are categories or fields where a standardized method is not available. For example, there are no standardized methods for text typesetting rules such as line-breaking and hyphenation.

B. Implementation using knowledge of each language: This approach is to directly implement information about each language based on the knowledge of programmers and contributors. L10N almost always uses this approach. The advantage of this approach is that a detailed and strict implementation is possible beyond the field where standardized methods are available, such as auto-detection of encodings of text files to be read. Language-specific problems can be perfectly solved; of course, it depends on the skill of the programmer). The disadvantages are (1) that the number of supported languages is restricted by the skill or the interest of the programmers or the contributors, (2) that labor which should be united and concentrated to upgrade the kernel or libraries is dispersed into many softwares, that is, re-inventing of the wheel, and (3) a user has to learn how to configure each software, such as LESSCHARSET variable, .emacs file, and other methods. This approach can cause problems: for example, GNU roff (before version 1.16) assumes 0xad as a hyphen character, which is valid only for ISO-8859-1. However, a majestic M17N software such as Mule can be built using this approach.

Using this classification, let me consider the L10N, I18N, and M17N models from the programmer's point of view.

The L10N model can be realized only using his or her own knowledge on his or her language (i.e. approach B). Since the motivation of L10N is usually to satisfy the programmer's own need, extendability for the third languages is often ignored. Though L10N-ed softwares are primarily useful for people who speaks the same language to the programmer, it is sometimes useful for other people whose coding system is similar to the programmer's. For example, a software which doesn't recognize EUC-JP but doesn't break EUC-JP, will not break EUC-KR also.

The main part of the I18N model is, in the case of a C program, achieved using standardized locale technology and gettext. An locale approach is classified into I18N because functions related to locale change their behavior by the current locales for six categories which are set by setlocale(). Namely, approach A is emphasized for I18N. For field where standardized methods are not available, however, approach B cannot be avoided. Even in such a case, the developers should be careful so that a support for new languages can be easily added later even by other developers.

The M17N model can be achieved using international encodings such as ISO 2022 and Unicode. Though you can hard-code these encodings for your software (i.e. approach B), I recommend to use standardized locale technology. However, using international encodings is not sufficient to achieve the M17N model. You will have to prepare a mechanism to switch input methods. You will also want to prepare an encoding-guessing mechanism for input files, such as jless and emacs have. Mule is the best software which achieved M17N (though it does not use locale technology).

2.2 Organization

Let's preview the contents of each chapter in this document.

As I wrote, this document will put stress on correct handling of characters and character codes for users' native languages. To achieve this purpose, I will start the real contents of this document by discussing basic important concepts on characters in Important Concepts for Character Coding Systems, Chapter 3. Since this chapter includes many terminologies, all of you will need to this chapter. The next chapter, Coded Character Sets And Encodings in the World, Chapter 4, introduces many national and international standards of coded character sets and encodings. I think almost of you can do without reading this chapter, since LOCALE technology will enable us to develop international softwares without knowledges on these character sets and encodings. However, knowing about these standards will help you to understand the merit and necessity of LOCALE technology.

The following chapter of Characters in Each Country, Chapter 5 describes the detailed informations for each language. These informations will help people who develop high-quality text processing softwares such as DTP and Web Browsers.

Chapter of LOCALE technology, Chapter 6 describes the most important concept for I18N. Not only concepts but also many important C functions are introduced in this chapter.

A few following chapters of Output to Display, Chapter 7, Input from Keyboard, Chapter 8, Internal Processing and File I/O, Chapter 9, and the Internet, Chapter 10 are important and frequent applications of LOCALE technology. You can get solutions for typical problems on I18N in these chapters.

You may need to develop software using some special libraries or other languages than C/C++. Chapters of Libraries and Components, Chapter 11 and Softwares Written in Other than C/C++, Chapter 12 are written for such purposes.

Next chapter of Examples of I18N, Chapter 13 is a collection of case studies. Both of generic and special technologies will be discussed. You can also contribute writing a section for this chapter.

You may want to study more; The last chapter of References, Chapter 14 is supplied for this purpose. Some of references listed in the chapter are very important.

[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n

14 February 2003
Tomohiro KUBOTA kubota@debian.org

Introduction to i18n Chapter 2 - Introduction

2.1 General Concepts

2.2 Organization

Introduction to i18n
Chapter 2 - Introduction