This chapter describes a specific information for each language. If you are developing a serious DTP software or planning to support detailed I18N, this chapter may help you. Contributions from people speaking each language are welcome. If you are to write a section on your language, please include these points:
Writers whose languages are written in different direction from European languages or needs a combined characters (I heard that is used in Thai) are encouraged to explain how to treat such languages.
This section is the text written by Tomohiro KUBOTA kubota@debian.org
.
Japanese is the only official language used in Japan. People in Okinawa islands and Ainu ethnic group in Hokkaido region have each language, though they are used among few number of people and they don't have own letters.
Japan is the only region where Japanese language is widely used.
There are three kinds of characters used in Japan, Hiragana, Katakana, and Kanji. Arabic numerical characters (same as European languages) are widely used in Japanese, though we have Kanji numerical characters. Though Latin alphabets are not a part of Japanese characters, they are widely used for proper nouns for companies and so on.
Hiragana and Katakana are phonogram derived from Kanji. Hiragana and Katakana characters have one-to-one correspondence each other like upper and lower case of Latin alphabets. However, toupper() and tolower() should not convert Hiragana and Katakana each other. Hiragana contains about 100 characters and of course Katakana does. (FYI: about 50 regular characters, 20 characters with voiced consonant symbol, 5 characters with semi-voiced consonant symbol, and 9 small characters.)
Kanji is ideogram imported from China roughly about 1 - 2 thousands years ago. Nobody knows the whole number of Kanji and almost all of adult Japanese people know several thousands of Kanji characters. Though the origin of Kanji is Chinese character, shapes are changed from original ancient Chinese Kanji. Almost all Kanji have several ways to read, according to the word the Kanji is contained.
JIS (Japan Industrial Standards) is an organization responsible for coded character sets (CCS) and encodings used in Japan. The major coded character sets in Japan are:
JIS X 0201 Roman is the Japanese version of ISO 646. Though JIS X 0201 is included in SHIFT-JIS encoding (explained later) and widely used for Windows/Macintosh, usage of this is not encouraged in UNIX.
JIS X 0201 Kana defines about 60 KATAKANA characters. This is widely used by old 8bit computers. In deed, SHIFT-JIS encoding was designed to be upward-compatible with 8-bit encoding of JISX 0201 Roman and JISX 0201 Kana. Note this CCS is not included in ISO 2022-JP encoding which is used for e-mail and so on.
JIS X 0212 is not widely used, probably because it cannot be included in SHIFT-JIS, the standard encoding for Japanese version of Windows and Macintosh. And more, this CCS may be obsolete when JIS X 0213 will be popular, since JIS X 0213 has many characters which are included in JIS X 0212. However, the advantage of JIS X 0212 over JIS X 0213 is that all characters in JIS X 0212 are included in the current Unicode (version 3.0.1) while not all characters in JIS X 0213 are.
JIS X 0208 (aka JIS C 6226) is the main standard for Japanese characters. Strictly speaking, it was originally defined in 1978 and revised on 1983, 1990, and 1997. Though 1997 version has 77 more characters than original 1976 version and shape of more than 200 characters are changed, almost softwares don't have to care about the difference between them. However, be careful of that ISO-2022-JP encoding (explained below) contains both JIS X 0208-1978 and JIS X 0208-1983. 1978 version is called 'old JIS' and later is called 'new JIS'. Characters in JIS X 0208 are divided into two levels, 1st and 2nd. Old 8bit computers rarely implemented the 2nd level.
Usage of numeric characters and Latin alphabets in JIS X 0208 is not encouraged because these characters are also included in ASCII and JIS X 0201 Roman, either of which is included in all encodings. When converting into Unicode, these characters are mapped into 'fullwidth forms'.
All of these coded character sets (except for JIS X 0213) are included in Unicode 3.0.1. A part of JIS X 0213 characters are not included in Unicode 3.0.1.
There are a few different tables for conversion between non-letter characters
in JIS X 0208 and Unicode. This is a problem because this may deny 'round-trip
compatiblilty'. Problems and
Solutions for Unicode and User/Vendor Defined Characters
discusses
this problem in detail.
There are three popular encodings widely used in Japan.
ISO-2022-JP is a subset of 7bit version of ISO 2022, where only G0 is used and G0 is assumed to be invoked into GL. Character sets included in ISO-2022-JP are:
Note that JIS X 0208-1978 and JIS X 0208-1983 are almost identical and ASCII and JIS X 0201-1976 Roman are also almost identical. A line (stream of bytes between 'newline' control code) must start by ASCII status and to end by ASCII status. See ISO 2022, Section 4.3 for detail.
ISO-2022-JP-2 (RFC 1554) is a subset of 7bit version of ISO 2022 and superset of ISO-2022-JP. Difference between ISO-2022-JP and ISO-2022-JP-2 is that ISO-2022-JP-2 has more coded character sets than ISO-2022-JP. Character sets included in ISO-2022-JP-2 are:
Though JIS X 0212-1990 may sometimes be used, ISO-2022-JP-2 is rarely used.
ISO-2022-INT-1 is a superset of ISO-2022-JP-2 which has CNS 11643-1986-1 and CNS 11643-1986-2 (traditional Chinese).
EUC-JP is a version of EUC, where G0 is ASCII, G1 is JIS X 0208, G2 is JIS X 0201 Kana, and G3 is JIS X 0212. G2 and G3 are sometimes not implemented. This is the most popular encoding for Linux/Unix. See EUC (Extended Unix Code), Section 4.3.1 for detail.
SHIFT-JIS is designed to be a superset of encodings for old 8bit computers which includes JIS X 0201 Roman and JIS X 0201 Kana. 0x20 - 0x7f is JIS X 0201 Roman and 0xa0 - 0xdf is JIS X 0201 Kana. 0x80 - 0x9f and 0xe0 - 0xff is the first byte of doublebyte characters. The second byte is 0x40 - 0x7e and 0x80 - 0xfc. This code space is used for JIS X 0208.
UNICODE is not popular in Japan at all, probably because conversion from these codes into Unicode is a bit difficult. However MS Windows uses Unicode in a limited field, for example, internal code for file names. I guess more and more softwares will come to support Unicode in the future.
You can convert files written in these encodings one another using
nkf
or kcc
package. Using options -j,
-s, and -e, nkf
convert a file into
ISO-2022-JP (aka JIS), SHIFT-JIS (aka MS-KANJI), and EUC-JP, respectively.
Note that difference between JIS X 0201 Roman and ASCII is ignored. Though
nkf
can guess the encoding of the input file, you can specify the
encoding by command option. This is because there are no algorithm to
completely distinguish EUC-JP and SHIFT-JIS, though nkf
usually
guesses correctly. tcs
can also convert these encodings, though
without guessing input encoding. Conversion between these encodings can be
done with a simple algorithm since all of them are based on the same character
sets. You need a table for code conversion between these encodings and
Unicode.
Since EUC-JP is widely used for UNIX, EUC-JP should be supported. Exceptions are shown below. Of course direct implementation of knowledge on EUC-JP is not encouraged. If you can implement without the knowledge by use of wchar_t and so on, you should do so.
In consoles which are able to display Japanese characters (kon, jfbterm, kterm, krxvt, and so on), characters in JIS X 0201 (Roman and Kana) occupy 1 column and characters in JIS X 0208, JIS X 0212, and JIS X 0213 occupy 2 columns.
Japanese language can be written in vertical direction. A line goes downward and the row of lies goes from right to left. This direction is the traditional style. For example, most Japanese books, magazines and newspapers except for in the field of natural science (or ones containing many Latin words or equations) are written in vertical direction. Thus a word processor is strongly recommended to support this direction. DTP systems which don't support this direction are almost useless.
Japanese language can also written in the same direction to Latin languages. Japanese books and magazines on science and technology are written in this direction. It is enough for almost usual softwares to support this direction only.
A few Japanese characters have to have different fonts for vertical direction. They are reasonable characters --- parentheses and 'long syllable' symbol whose shape is like dash in English or mathematical 'minus' sign. Symbols equivalent to period and comma also have different style for horizontal and vertical direction.
In Japan, Arabic numerical characters are widely used, like European languages, though we have Kanji (ideogram) numerical characters. Latin characters can also appear in Japanese texts. If a row of 1 - 3 (or 4) characters of Arabic and Latin appear in Japanese vertical text, these characters can be crowded into one column. If more characters appear (large numbers or long words), the paper is rotated 90 degree in anticlockwise and the characters are written in European way. Sometimes Latin characters which appears in vertical text are written in the same way as Japanese character, i.e., vertical direction. This is not so strong custom. Arabic and Latin characters can always be written in both normal and rotated way in vertical text. [17] DTP system should support all of them.
A version of Japanized TeX (developed by ASCII, a publishing company in Japan) can use vertical direction. This can also treat a page containing both vertical and horizontal texts.
In Japanese language, words are not separated by space and a line can be broken anywhere, with a few exceptions, unlike European languages. Thus hyphenation is not needed for Japanese.
Characters like open parentheses cannot come to the end of a line. Characters like close parentheses and sorts of sentence-separating marks such as period and comma cannot come to the top of a line. This rule and processing is called 'kinsoku' in Japanese.
In European languages, a break of line is equivalent to a space. In Japanese language, a break of line should be neglected. For example, when rendering an HTML file, line-breaking character in the HTML source should not be converted into whitespace.
Different value of LANG used for different encodings.
Following values are used for EUC-JP.
LANG=ja_JP.jis is used for ISO-2022-JP (aka JIS code or JUNET code).
LANG=ja_JP.sjis is used for SHIFT-JIS (aka Microsoft Kanji Code).
Setting LANG is not sufficient for a Japanese user who has just installed Linux
to get a minimal Japanese environment. There are several books on establishing
Japanese environment on Linux/BSD and magazines on Linux often have feature
articles on how to establish Japanese environment. Nowadays many Japanized
Linux distributions which are optimized so that many basic software can display
and input Japanese are popular. Debian GNU/Linux has user-ja
(for
potato) and language-env
(for woody and following versions)
packages to establish basic Japanese environment.
Since Japanese characters cannot be inputed directly from a keyboard, a
software is needed to convert ASCII characters into Japanese.
WNN
, Canna
, and SKK
are popular free
softwares to input Japanese language. Though T-Code
is also
available, it is difficult to use. Since these adopt server/client model and
implement their own protocols, we cannot input Japanese only with
wnn
, canna
, or skk
(and their depending
packages).
In X Window System environment, kinput2-*
and
skkinput
packages connects these protocols and XIM, which is the
standard input protocol for X. Kinput2 also has an original protocol and
kterm
and so on can be a client of kinput2 protocol. Kinput2
protocol was developed before international standards such as XIM (or Ximp or
Xsi) became available.
On console, there are no standard and each software has to support wnn and/or
canna protocol. For example, jvim-canna
,
xemacs21-mule-canna
, and emacs20 with emacs-dl-canna
or emacs-dl-wnn
. Thus the ways to operate are different between
softwares. skkfep
provides a general way to input Japanese on
console.
Then the way to input Japanese is explained.
Since almost Hiraganas and Katakanas represents a pair of a vowel and a consonant with one character, we can input one Hiragana or one Katakana with two Latin alphabets. A few Hiraganas and Katakanas need one or three alphabets.
Kanji is obtained by converting from Hiragana. There are many Japanese words which are expressed by two or more Kanjis and almost recent converting softwares can convert such words at a time. (Old softwares can convert one Kanji at a time. You must be patient to use this way.) Softwares with good grammar/context analyzer and large dictionary can convert longer phrases or even a whole sentence at a time. However, we usually have to select one Kanji or word from candidates the software shows, because Japanese language has many homophones. For example, 61 Kanjis whose readings are 'KAN' and 6 words whose readings are 'KOUKOU' are registered in dictionary of canna. (Today, 2 Oct 1999, I saw a TV advertisement film of Japanese word processor which insists the software can correctly convert an input into 'a cafe which opened today', not 'a cafe which rotated today'. Though Japanese word 'KAITEN' means both 'open (a shop)' and 'rotate', the software knows it is more usual for a cafe to open than to rotate.)
The conversion from Hiragana to Kanji needs a large dictionary which contains the Kanji spelling and readings of Japanese major words and conjugation or inflection. Thus proprietary softwares tend to efficiently convert. They usually have dictionaries larger than few megabytes. Some of these recent proprietary softwares even analyze the topic or meaning of the inputed Hiragana sentence and choose the most appropriate homophone, though they often choose wrong ones.
Nowadays several proprietary conversion softwares such as ATOK, WNN6, and VJE for Linux are sold in Japan.
Since it is complex and hard work for users to input Japanese characters, we don't want to input Y (for YES) or N (for NO) in Japanese. We prefer learning such basic English words to inputing Japanese words by invoking conversion software, inputing Latin alphabetic expression of Japanese, converting it into Hiragana, converting it into Kanji, choosing the correct Kanji, determining the correct Kanji, and ending the conversion software each time we need to input yes or no or similar words.
Different from European languages, Japanese characters should written in a fixed width. Exceptions arises when two symbols such as parentheses, periods and commas continue. Kerning should be done for such cases if the software is a word processor. A text editor need not.
Ruby is a small (usually 1/2 in length and 1/4 in area or a bit smaller) characters written above (in horizontal direction) or at right side (in vertical direction) of the main text. This is usually used to show a reading of difficult Kanji.
Japanized TeX can use ruby by using an extra macro. Word processors should have Ruby faculty.
Japanese character does not have upper and lower case although there two sets of phonograms, Hiragana and Katakana.
Thus tolower() and toupper() should not convert between Hiragana and Katakana.
Hiragana is used for usual text. Katakana is used mainly for express foreign or imported words, for example, KONPYU-TA for computer, MAIKUROSOFUTO for Microsoft, and AINSYUTAIN for Einstein.
Phonograms (Hiragana and Katakana) have sorting order. The order is same to defined in JIS X 0208, with a few exceptions.
Ideograms (Kanji) sorting is difficult. They should be sorted by their reading but almost all kanji have a few readings according to the context. So if you want to sort Japanese text, you will need a dictionary of whole Japanese Kanji words. And more, a few Japanese words written in Kanji have different readings with exactly same series of Kanjis, this can occur especially for names of person. So it is usual that addressbook databases have two 'name' columns, one for Kanji expression and the other for Hiragana.
I know no softwares which can sort Japanese words in perfect way, including free and proprietary softwares.
We have a phonetic alphabetic expression of Japanese, Ro-ma ji. It has almost one-to-one correspondence to Japanese phonogram. It can be used to display Japanese text on Linux console and so on. Since Japanese have many homophones this expression can be crabbed.
There are several variants of Ro-ma ji.
The first distinguishing point is on handling of long syllable. For example, long syllable of 'E' is expressed in:
The second distinguishing point is some special pairs of vowel and consonant. For example, Hiragana character for combination of 'T' and 'I' is pronounced like 'CHI'.
Section written by Eusebio C Rufian-Zilbermann eusebio@acm.org
.
Spanish is one of the official languages in Spain, the official language in most of the countries in the American continent and the official language in Equatorial Guinea. It is spoken in many other regions where it is not the official language. Other official languages in Spain are Galician, Catalan and Basque. These other languages each have their own specific issues with regards to Localization. They are not described in this section of the document.
The Spanish Language derives from the variation spoken in the Castille region. The term Castillian is sometimes used to refer to the Spanish language (particularly when an author wants to stress the fact that there are other languages spoken in Spain). Both Castillian and Spanish language refer to the same language, they are not different things.
Spanish uses a Latin alphabet. The numerical characters used in Spanish are the Arabic numerals.
The character that distinguishes Spanish from other Latin alphabets is the Ñ ('N' with tilde), which exists in uppercase and lowercase versions. Vowels in Spanish may have a mark (the accent) on top of them to indicate intensity intonation. This accent is required for orthography (written correctness) on lowercase vowels but it is optional in uppercase vowels. The letter 'u' may have a dieresis (like the German umlaut), both in uppercase and lowercase forms.
Some punctuation signs are characteristic of the Spanish language. The opening question mark and the opening exclamation sign look like the English question mark and exclamation sign rotated 180 degrees. The English question mark and exclamation sign are referred to as closing question mark and exclamation sign. The small underlined 'a' and 'o' are used mainly for ordinal numbers, similar to the small 'th' in English ordinals.
UNE (Una Norma Española) is the National Standards Organization in Spain. UNE is a member of the ISO and standards that have one-to-one correspondence are usually called by their ISO number, rather than their UNE number.
ISO 8859-1, also known as ISO Latin-1, contains the characters required for Spanish.
The codeset mostly used for Spanish is ISO 8859-1. The codepage Windows 1252 a.k.a. Windows Latin-1 is a superset of ISO 8859-1 that adds some characters in the range 128 to 159. Other codesets are Unicode, Macintosh Roman (codepage 1000), MS-DOS Latin-1 (codepage 850) or less frequently MS-DOS Latin US (codepage 437) which contains accented lowercase characters but not uppercase. Some additional Latin codesets are EBCDIC CP500 and CP 1026 (used in IBM mainframes and terminal emulators), Adobe Standard (used as default for Postscript documents), Nextstep Latin, HP Roman 8 (for HPUX and Laserjet resident printer fonts) and the Latin codepage in OS/2. They are all stateless, 8-bit codepages (with the exception of Unicode that is 16-bit).
In most cases it is safe to use ISO 8859-1 characters. Some exceptions are
On console displays, each character occupies one column. Printed text can be equally spaced (one column per character) or proportionally spaced (a character can occupy fractionally more or less than a column, depending on its shape).
Note: Even when using Traditional Sorting, ch and ll occupy two columns. See the comment on Traditional sorting in Sorting, Section 5.2.10.1.
Spanish is normally written in left to right lines arranged from top to bottom of the page. For artistic purposes it might be written in top to bottom columns arranged left to right within the page. This columnar arrangement would be expected only in graphic and charting programs (e.g., a drawing program, a spreadsheet graph or a page layout program for composing brochures) but regular text editors wouldn't be expected to implement this style.
In the Spanish language, words are separated by spaces and a line can be broken at a space, a punctuation sign or a hyphenated word.
There are several sets of paired characters in Spanish. Unlike English, question marks and exclamation signs are also paired. Other paired characters are the same as English (parenthesis, square brackets, and so forth). Opening characters shouldn't appear at the end of a line. Closing characters and punctuation signs such as period and comma shouldn't appear at the beginning of a line.
Words can be broken at a syllabus and hyphenated. Unlike English, syllabi in Spanish end in a vowel more often than in a consonant. Syllabi that end in a consonant letter are typically at the end of a word or followed by a syllabus that starts with another consonant. Anyway, the rules are not completely consistent and a hyphenation dictionary has to be used.
For Bash
set meta-flag on # keep all 8 bits for keyboard input set output-meta on # keep all 8 bits for terminal output set convert-meta off # don't convert escape sequences export LC_CTYPE=ISO_8859_1
For Tcsh
setenv LANG C setenv LC_CTYPE "iso_8859_1"
For the Spanish keyboard to work correctly, you need the command loadkeys /usr/lib/kbd/keytables/es.map in the corresponding startup (rc) file.
Most of the Spanish characters are input from the keyboard with a single stroke. A two-key combination is used for accent and dieresis marks above vowels. Traditional typewriter machines used a 'dead key' system with keys that would strike the paper without advancing the carriage to the next character. Typing on a computer keyboard simulates this behavior, typing the accent or dieresis key does not produce any visible output until a vowel is typed afterwards. Usually if the accent or dieresis key is followed by a consonant, the accent key is ignored. Accented or dieresis characters cannot be used for shortcut keys for selecting options.
The words for Yes and No are Sí (the character next to S is 'i' with acute accent) and No. We would commonly use the S and N keys for a Sí/No choice.
Spanish keyboards usually allow for typing not only the Spanish accent signs, but also the accent signs in French and other languages (grave accent, circumflex accent, umlaut on letters other than the u). Other character that is typically available is the cedilla C (that looks like a C with a comma underneath, used for Catalan, Portuguese and French words, for example). There is a Latin-American keyboard layout that does not contain the grave accent and the cedilla C.
Traditional Spanish considered the combinations CH and LL individual single letters. For usage in computers, this required an additional effort for sorting and character counting algorithms. It was decided that the savings in not requiring special algorithms was significant enough and that it would be acceptable to treat them as 2 separate letters. Some software that already had incorporated the special sorting algorithms now allows for choosing between 'Traditional Spanish Sort' and 'Modern Spanish Sort'.
Accents and dieresis are ignored for sorting purposes. The only exception is the rare case where two words are exactly the same and the accent is the only difference, the word with the unaccented character should be sorted first. E.g., camión (c-a-m-i-o with acute accent-n), camionero, este, éste (e with acute accent-s-t-e).
The ñ (n with tilde) is always sorted after the n and before the l. It cannot be intermixed with the n.
The use of the dot and the comma as a thousands separator and for decimal places is usually the opposite of US English. E.g., 1.000,00 instead of 1,000.00. Some Spanish-speaking countries, notably Mexico, follow the same standards as the US. It is desirable that programs can handle both forms as an independent setting.
The usual date format is DD-MM-YYYY rather than MM-DD-YYYY, but again this depends on the specific country. It is desirable to have the date format as a configurable parameter.
The currency symbol can be prepended or appended to the number and it can be one or several characters long. E.g., 100 PTA for Spanish pesetas or N$ 100 for Mexican pesos. It is desirable that the symbol and position can be individually defined and to allow for currency symbols longer than 1-character.
Spanish is spoken by a tremendous variety of people. Academics through the different Spanish-speaking countries realized that this could lead to a dismemberment of the language and founded the Academy of the Spanish Language. This academy has branches in most of the Spanish-speaking countries, there is a Royal Academy of the Spanish Language of Spain, an Academy of the Spanish Language of Mexico, et cetera. The members of this Academy study the local evolution of the languages in each country. They meet together to maintain a body of knowledge of what should be considered the Standard Spanish Language and what should be considered local or regional terms and slang terms.
In most cases, software can use terms that are within the Standard set by the Academy. When new terms appear (e.g., when a new product is created that has no previous name in the Spanish language) each region typically starts using a new word. When there is one or two terms that become the de-facto standard, the Academy would incorporate the new term into the Standard. This is a very slow process and there will be temporary usages in different regions within the Spanish-speaking worlds that conflict with each other. Some people speak about Spain-Spanish and American-Spanish but most of the time it doesn't really make sense to make this distinction. First of all, even within America, there are differences between the local varieties that may be greater than the differences with Spain itself. E.g., Spanish as spoken in Mexico, Colombia and Argentina may have between them as much differences as each of them when compared to how it is spoken in Spain. A computer user in Ecuador may feel more comfortable overall with the terms used in Spain than with the terms used in Mexico (and of course, most comfortable with the terms used in Ecuador itself!). The options are to either produce one Spanish version of a software product that is an acceptable compromise (maybe not perfect) for all Spanish-speaking countries or to produce multiple versions to account for all the regional variations.
A plea to all the people who are localizing software into Spanish: Let's use our efforts judiciously and create one Spanish version and not many. Let's strive for a version that conforms to the Standards and that can be as widely accepted as possible for the areas not covered by the Standards. Wouldn't you rather have a new product translated, instead of two versions of a product where one matches your local variety of the language?
Section written by Alexander Voropay a.voropay@globalone.ru
.
First of all, there are a lot of languages with Cyrillic script.
Slavic languages : Russian (ru), Ukrainian (uk), Belarussian (be), Bulgarian (bg), Serbian (sr), and Macedonian (mk).
Another Slavic languages (Polish(pl), Czech(cz), Croatian(hr)) uses Latin script : mainly ISO-8859-2 (Central-European).
During USSR time some non-slavic languages got own alpabets, based on modifyed cyrillic characters. Azerbaijani (az), Turkmen (tk), Kurdish (ku), Uzbek (uz), Kazakh (kk), Kirghiz (ky), Tajik (tg) and Mongolian (mn) Komi (kv) e.t.c.
UNICODE has rich Cyrillic section.
Ufortunately, there are a lot of 8-bit Cyrillic Charsets. There is no one
universal 8-bit Cyrillic charset, because, for example, there are about 260
Cyrillic characters in Adobe
Glyph List
.
The overview "The Cyrillic Charset
Soup
".
The main problem with Russian : there are at least six live Charsets:
So, Russian computers really live in "Charset mix", like Japanese : Shift-JIS, ISO2022-JP, EUC-JP. You can get e-mail in any charset, so your Mail Agent should understand all this charsets. Takasiganai.
In POSIX environment you should setup FULL locale name (with .Charset field) :
LANG=ru_RU.KOI8-R LANG=ru_RU.ISO_8859-5 LANG=ru_RU.CP1251
e.t.c. for proper sorting, character classification and for readable messages. Any form of abbreviations ("ru", "ru_RU" e.t.c.) are sourse of misunderstanding. I hope, Unicode LANG=ru_RU.UTF-8 will save us in near future...
Introduction to i18n
14 February 2003kubota@debian.org