[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n
Chapter 6 - LOCALE technology

LOCALE is a basic concept introduced into ISO C (ISO/IEC 9899:1990). The standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995). In LOCALE model, the behaviors of some C functions are dependent on LOCALE environment. LOCALE environment is divided into a few categories and each of these categories can be set independently using setlocale().

POSIX also determines some standards around i18n. Almost of POSIX and ISO C standards are included in XPG4 (X/Open Portability Guide) standard and all of them are included in XPG5 standard. Note that XPG5 is included in UNIX specifications version 2. Thus support of XPG5 is mandatory to obtain Unix brand. In other words, all versions of Unix operating systems support XPG5.

The merit of using locale technology over hard-coding of Unicode is:

The software can be written encoding-independent way. This means that this software can support all encodings which the OS supports, including 7bit, 8bit, multibyte, stateful, and stateless encodings such as ASCII, ISO 8859-*, EUC-*, ISO 2022-*, Big5, VISCII, TIS 620, UTF-*, and so on.

The software will provides a common unified method to configure locale and encoding. This benefits users. Otherwise, users will have to remember the method to enable UTF-8 mode for each software. Some softwares need -u8 switch, other need X resource setting, other need .foobarrc file, other need a special environmental variable, other use UTF-8 for default. It is nonsense!

The advancement of the OS means the advancement of the software. Thus, you can use new locale without recompiling your software.

You can read the Unicode support in the Solaris Operating Environment whitepapaer and understand the merit of this model. Bruno Haible's Unicode HOWTO also recommends this model.

6.1 Locale Categories and `setlocale()`

In LOCALE model, the behaviors of some C functions are dependent on LOCALE environment. LOCALE environment is divided into six categories and each of these categories can be set independently using setlocale().

The followings are the six categories:

LC_CTYPE: Category related to encodings. Characters which are encoded by LC_CTYPE-dependent encoding is called multibyte characters. Note that multibyte character doesn't need to be multibyte.
LC_CTYPE-dependent functions are: character testing functions such as islower() and so on, multibyte character functions such as mblen() and so on, multibyte string functions such as mbstowcs() and so on, and so on.

LC_COLLATE: Category related to sorting. strcoll() and so on are LC_COLLATE-dependent.

LC_MESSAGES: Category related to the language for messages the software outputs. This category is used for gettext.

LC_MONETARY: Category related to format to show monetary numbers, for example, currency mark, comma or period, columns, and so on. localeconv() is the only function which is LC_MONETARY-dependent.

LC_NUMERIC: Category related to format to show general numbers, for example, character for decimal point.
Formatted I/O functions such as printf(), string conversion functions such as atof(), and so on are LC_NUMERIC-dependent.

LC_TIME: Category related to format to show time and date, such as name of months and weeks, order of date, month, and year, and so on.
strftime() and so on are LC_TIME-dependent.

setlocale() is a function to set LOCALE. Usage is char *setlocale(int category, const char *locale);. Header file of locale.h is needed for prototype declaration and definition of macros for category names. For example, setlocale(LC_TIME, "de_DE");.

For category, the following macros can be used: LC_CTYPE, LC_COLLATE, LC_MONETARY, LC_NUMERIC, LC_TIME, and LC_ALL. For locale, specific locale name, NULL, or "" can be specified.

Giving NULL for locale will return the current value of the specified locale category. Otherwise, setlocale() returns the newly set locale name, or NULL for error.

Given "" for locale, setlocale() will determine the locale name in the following manner:

At first, consult LC_ALL environmental variable.

If LC_ALL is not available, consult environmental variable same as the name of the locale category. For example, LC_COLLATE.

If none of them are available, consult LANG environmental variable.

This is why a user is expected to set LANG variable. In other words, all what a user has to do is to set LANG variable so that all locale-compliant softwares work well for desired way.

Thus, I recommend strongly to call setlocale(LC_ALL, ""); at the first of your softwares, if the softwares are to be international.

6.2 Locale Names

We can specify locale names for these six locale categories. Then, which name should we specify?

The syntax to build a locale name is determined as follows:

       language[_territory][.codeset][@modifier]

where language is two lowercase alphabets described in ISO639, such as en for English, eo for Esperanto, and zh for Chinese, territory is two uppercase alphabets described in ISO3166, such as GB for United Kingdom, KR for Republic of Korea (South Korea), CN for China. There are no standard for codeset and modifier. GNU libc uses ISO-8859-1, ISO-8859-13, eucJP, SJIS, UTF8, and so on for codeset, and euro for modifier.

However, it is depend on the system which locale names are valid. In other words, you have to install locale database for locale you want to use. Type locale -a to display all supported locale names on the system.

Note that locale names of "C" and "POSIX" are determined for the names for default behavior. For example, when your software need to parse the output of date(1), you'd better call setlocale(LC_TIME, "C"); before invocation of date(1).

6.3 Multibyte Characters and Wide Characters

Now we will concentrate on LC_CTYPE, which is the most important category in six locale categories.

Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*, ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world. It is inefficient and a cause of bugs, even not impossible, for every softwares to implement all these encodings. Fortunately, we can use LOCALE technology to solve this problem. [18]

Multibyte characters is a term to call characters encoded in locale-specific encoding. It is nothing special. It is mere a word to call our daily encodings. In ISO 8859-1 locale, ISO 8859-1 is multibyte character. In EUC-JP locale, EUC-JP is multibyte character. In UTF-8 locale, UTF-8 is multibyte character. In short, multibyte character is defined by LC_CTYPE locale category. Multibyte characters is used when your software inputs or outputs text data from/to everywhere out of your software, for example, standard input/output, display, keyboard, file, and so on, as you are doing everyday. [19]

You can handle multibyte characters using ordinal char or unsigned char types and ordinal character- and string-oriented functions. It is just like you used to do for ASCII and 8bit encodings.

Then why we call it with a special term of multibyte character? The answer is, ISO C specifies a set of functions which can handle multibyte characters properly. On the other hand, it is obvious that usual C functions such as strlen() cannot handle multibyte characters properly.

Then what is these functions which can handle multibyte characters properly? Please wait a minute. Multibyte character may be stateful or stateless and multibyte or non-multibyte, since it includes all encodings ever used and will be used on the earth. Thus it is not convenient for internal processing. It needs complex algorithm even for, for example, character extraction from a string, addition and division of a string, or counting of number of character in a string. Thus, wide characters should be used for internal processing. And, the main part of these C functions which can handle multibyte characters are functions for interconversion between multibyte characters and wide characters. These functions are introduced later. Note that you may be able to do without these functions, since ISO C supplies I/O functions with conversion.

Wide character is defined in ISO C

that all characters are expressed in fixed width of bits.

that it is stateless, i.e., it doesn't have shift states.

There are two types for wide characters: wchar_t and wint_t. wchar_t is a type which can contain one wide character. It is just like 'char' type can be used for contain one character. wint_t can contain one wide character or WEOF, an substitution of EOF.

A string of wide characters is achieved by an array of wchar_t, just like a string of characters is achieved by an array of char.

There are functions for wchar_t, substitute for functions for char.

strcat(), strncat() -> wcscat(), wcsncat()

strcpy(), strncpy() -> wcscpy(), wcsncpy()

strcmp(), strncmp() -> wcscmp(), wcsncmp()

strcasecmp(), strncasecmp() -> wcscasecmp(), wcsncasecmp()

strcoll(), strxfrm() -> wcscoll(), wcsxfrm()

strchr(), strrchr() -> wcschr(), wcsrchr()

strstr(), strpbrk() -> wcsstr(), wcspbrk()

strtok(), strspn(), strcspn() -> wcstok(), wcsspn(), wcscspn()

strtol(), strtoul(), strtod() -> wcstol(), wcstoul(), wcstod()

strftime() -> wcsftime()

strlen() -> wcslen()

toupper(), tolower() -> towupper(), towlower()

isalnum(), isalpha(), isblank(), iscntrl(), isdigit(), isgraph(), islower(), isprint(), ispunct(), isspace(), isupper(), isxdigit() -> iswalnum(), iswalpha(), iswblank(), iswcntrl(), iswdigit(), iswgraph(), iswlower(), iswprint(), iswpunct(), iswspace(), iswupper(), iswxdigit() (isascii() doesn't have its wide character version).

memset(), memcpy(), memmove, memmove(), memchr() -> wmemset(), wmemcpy(), wmemmove, wmemmove(), wmemchr()

There are additional functions for wchar_t.

wcwidth(), wcswidth()

wctrans(), towctrans()

You cannot assume anything on the concrete value of wchar_t, besides 0x21 - 0x7e are identical to ASCII. [20] You may feel this limitation is too strong. If you cannot do under this limitation, you can use UCS-4 as the internal encoding. In such a case, you can write your software emulating the locale-sensible behavior using setlocale(), nl_langinfo(CODESET), and iconv(). Consult the section of nl_langinfo() and iconv(), Section 6.5. Note that it is generally easier to use wide character than implement UCS-4 or UTF-8.

You can write wide character in the source code as L'a' and wide string as L"string". Since the encoding for the source code is ASCII, you can only write ASCII characters. If you'd like to use other characters, you should use gettext.

There are two ways to use wide characters:

I/O is described using multibyte characters. Inputed data are converted into wide character immediately after reading and data for output are converted from wide character to multibyte character immediately before writing. Conversion can be achieved using functions of mbstowcs(), mbsrtowcs(), wcstombs(), wcsrtombs(), mblen(), mbrlen(), mbsinit(), and so on. Please consult the manual pages for these functions.

Wide characters are directly used for I/O, using wide character functions such as getwchar(), fgetwc(), getwc(), ungetwc(), fgetws, putwchar(), fputwc(), putwc(), and fputws(), formatted I/O functions for wide characters such as fwscanf(), wscanf(), swscanf(), fwprintf(), wprintf(), swprintf(), vfwprintf(), vwprintf(), and vswprintf(), and wide character identifier of %lc, %C, %ls, %S for conventional formatted I/O functions. By using this approach, you don't need to handle multibyte characters at all. Please consult the manual pages for these functions.

Though latter functions are also determined in ISO C, these functions have became newly available since GNU libc 2.2. (Of course all UNIX operating systems have all functions described here.)

Note that very simple softwares such as echo doesn't have to care about multibyte character. and wide characters. Such software can input and output multibyte character as is. Of course you may modify these softwares using wide characters. It may be a good practice of wide character programming. Examples of a fragment of source codes will be discussed in Internal Processing and File I/O, Chapter 9.

There is an explanation of multibyte and wide characters also in Ken Lunde's "CJKV Information Processing" (p25). However, the explanation is entirely wrong.

6.4 Unicode and LOCALE technology

UTF-8 is considered as the future encoding and many softwares are coming to support UTF-8. Though some of these softwares implement UTF-8 directly, I recommend you to use LOCALE technology to support UTF-8.

How this can be achieved? It is easy! If you are a developer of a software and your software has already written using LOCALE technology, you don't have to do anything!

Using LOCALE technology benefits not only developers but also users. All a user has to do is set locale environment properly. Otherwise, a user has to remember the method to use UTF-8 mode for each software. Some softwares need -u8 switch, other need X resource setting, other need .foobarrc file, other need a special environmental variable, other use UTF-8 for default. It is nonsense!

Solaris has been already developed using this model. Please consult Unicode support in the Solaris Operating Environment whitepapaer.

However, it is likely that some of upstream developers of softwares of which you are maintaining a Debian package refuses to use wchar_t for some reasons, for example, that they are not familiar with LOCALE programming, that they think it is troublesome, that they are not keen on I18N, that it is much easier to modify the software to support UTF-8 than to modify it to use wchar_t, that the software must work even under non-internationalized OS such as MS-DOS, and so on. Some developers may think that support of UTF-8 is sufficient for I18N. [21] Even in such cases, you can rewrite such a software so that it checks LC_* and LANG environmental variables to emulate the behavior of setlocale(LC_ALL, "");. You can also rewrite the software to call setlocale(), nl_langinfo(), and iconv() so that the software supports all encodings which the OS supports, as discussed later. Consult the discussion in the Groff mailing list on the support of UTF-8 and locale-specific encodings, mainly held by Werner LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA, the author of this document.

6.5 `nl_langinfo()` and `iconv()`

Though ISO C defines extensive LOCALE-related functions, you may want more extensive support. You may also want conversion between different encodings. There are C functions which can be used for such purposes.

char *nl_langinfo(nl_item item) is an XPG5 function to get LOCALE-related informations. You can get the following informations using the following macros for item defined in langinfo.h header file:

names for days in week (DAY_1 (Sunday), DAY_2, DAY_3, DAY_4, DAY_5, DAY_6, and DAY_7)

abbreviated names for days in week (ABDAY_1 (Sun), ABDAY_2, ABDAY_3, ABDAY_4, ABDAY_5, ABDAY_6, and ABDAY_7)

names for months in year (MON_1 (January), MON_2, MON_3, MON_4, MON_5, MON_6, MON_7, MON_8, MON_9, MON_10, MON_11, and MON_12)

abbreviated names for months in year (ABMON_1 (January), ABMON_2, ABMON_3, ABMON_4, ABMON_5, ABMON_6, ABMON_7, ABMON_8, ABMON_9, ABMON_10, ABMON_11, and ABMON_12)

name for AM (AM_STR)

name for PM (PM_STR)

name of era (ERA)

format of date and time (D_T_FMT)

format of date and time (era-based) (ERA_D_T_FMT)

format of date (D_FMT)

format of date (era-based) (ERA_D_FMT)

format of time (24-hour format) (T_FMT)

format of time (am/pm format) (T_FMT_AMPM)

format of time (era-based) (ERA_T_FMT)

radix (RADIXCHAR)

thousands separator (THOUSEP)

alternative characters for numerics (ALT_DIGITS)

affirmative word (YESSTR)

affirmative response (YESEXPR)

negative word (NOSTR)

negative response (NOEXPR)

encoding (CODESET)

For example, you can get names for months and use them for your original output algorithm. YESEXPR and NOEXPR are convenient for softwares expecting Y/N answer from users.

iconv_open(), iconv(), and iconv_close() are functions to perform conversion between encodings. Please consult manpages for them.

Combining nl_langinfo() and iconv(), you can easily modify Unicode-enabled software into locale-sensible truly internationalized software.

At first, add a line of setlocale(LC_ALL, ""); at the first of the software. If it returns non-NULL, enable UTF-8 mode of the software.

     int conversion = FALSE;
     char *locale = setlocale(LC_ALL, "");
        :
        :
     (original code to determine UTF-8 mode or not)
        :
        :
     if (locale != NULL && utf_mode == FALSE) {
         utf8_mode = TRUE;
         conversion = TRUE;
     }

Then modify input routine as following:

     #define INTERNALCODE "UTF-8"
     if (conversion == TRUE) {
         char *fromcode = nl_langinfo(CODESET);
         iconv_t conv = iconv_open(INTERNALCODE, fromcode);
         (reading and conversion...)
         iconv_close(conv);
     } else {
         (original reading routine)
     }

Finally modify the output routine as following:

     if (conversion == TRUE) {
         char *tocode = nl_langinfo(CODESET);
         iconv_t conv = iconv_open(tocode, INTERNALCODE);
         (conversion and writing...)
         iconv_close(conv);
     } else {
         (original writing routine)
     }

Note that whole reading should be done at once since otherwise you may divide multibyte character. You can consult the iconv_prog.c file in the distribution of GNU libc for usage of iconv().

Though nl_langinfo() is a standard function of XPG5 and GNU libc supports it, it is not very portable. And more, there are no standard for encoding names for nl_langinfo() and iconv_open(). If this is a problem, you can use Bruno Haible's libiconv. It has iconv(), iconv_open(), and iconv_close(). And more, it has locale_charset(), a replacement of nl_langinfo(CODESET).

6.6 Limit of Locale technology

Locale model has a limit. That is, it cannot handle two locales at the same time. Especially, it cannot handle relationship between two locales at all.

For example, EUC-JP, ISO 2022-JP, and Shift-JIS are popular encodings in Japan. EUC-JP is the de-facto standard for UNIX systems, ISO 2022-JP is the standard for Internet, and Shift-JIS is the encoding for Windows and Macintosh. Thus, Japanese people have to handle texts with these encodings. Text viewers such as jless and lv and editors such as emacs can automatically understand the encoding to be read. You cannot write such a software using Locale technology.

[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n

14 February 2003
Tomohiro KUBOTA kubota@debian.org

Introduction to i18n Chapter 6 - LOCALE technology

6.1 Locale Categories and setlocale()