The Internet is a world-wide network of computer. Thus the text data exchanged via the Internet must be internationalized.
The concept of internationalization did not exist at the dawn of the Internet, since it was developed in US. Protocols used in the Internet were developed to be upward-compatible with the existing protocols.
One of the key technology of the internationalization of the Internet data exchange is MIME.
Internet mail uses SMTP (RFC 821
) and ESMTP
(RFC 1869
)
protocols. SMTP is 7bit protocol and ESMTP is 8bit.
Original SMTP can only send ASCII characters. Thus non-ASCII characters (ISO 8859-*, Asian characters, and so on) have to be converted into ASCII characters.
MIME (RFC
2045
, 2046
, 2047
, 2048
, and 2049
) deals with this
problem.
At first RFC
2045
determines three new headers.
Now MIME-Version is 1.0 and thus all MIME mails have a header like this:
MIME-Version: 1.0
Content-Type describes the type of content. For example, an usual mail with Japanese text has a header like that:
Content-Type: text/plain; charset="iso-2022-jp"
Available types are described in RFC 2046
.
Content-Transfer-Encoding describes the way to convert the
contents. Available values are BINARY, 7bit,
8bit, BASE64, and QUOTED-PRINTABLE.
Since SMTP cannot handle 8bit data, 8bit and BINARY
cannot be used. ESMTP can use them. Base64 and quoted-printable are ways to
convert 8bit data into 7bit and 8bit data have to be converted using either of
them to sent by SMTP.
RFC 2046
describes media type and sub type for Content-Type header.
Available types are text, image, audio,
video, and application. Now we are interested in
text because we are discussing about i18n. Sub types for
text are plain, enriched,
html, and so on. charset parameter can also be added
to specify encodings. US-ASCII, ISO-8859-1,
ISO-8859-2, ..., ISO-8859-10 are defined by RFC 2046
for
charset. This list can be added by writing a new RFC.
RFC 1468
ISO-2022-JP
RFC 1554
ISO-2022-JP-2
RFC 1557
ISO-2022-KR
RFC 1922
ISO-2022-CN
RFC 1922
ISO-2022-CN-EXT
RFC 1842
HZ-GB-2312
RFC 1641
UNICODE-1-1
RFC 1642
UNICODE-1-1-UTF-7
RFC 1815
ISO-10646-1
RFC 2045
and
and RFC 2046
determine the way to write non-ASCII characters in the main text of mail. On
the other hand, RFC
2047
describes 'encoded words' which is the way to write non-ASCII
characters in the header. It is like that:
=?encoding?conversion
algorithm?data?=, where
encoding is selected from the list of charset of
Content-Type header, algorithm is Q or
q for quoted-printable or B or b for
base64, and data is encoded data whose length is less than 76 bytes.
If the data is longer than 75 bytes, it must be divided into
multiple encoded words. For example,
Subject: =?ISO-2022-JP?B?GyRCNEE7eiROJTUlViU4JSclLyVIGyhC?=
reads 'a subject written in Kanji' in Japanese (ISO-2022-JP, encoded by base64). Of course human cannot read it.
WWW is a system that HTML documents (mainly; and files in other formats) are transferred using HTTP protocol.
HTTP protocol is defined by RFC 2068
. HTTP uses
headers like mails and Content-Type header is used to describe the
type of the contents. Though charset parameter can be described
in the header, it is rarely used.
RFC 1866
describes that the default encoding for HTML is ISO-8859-1. However, many web
pages are written in, for example, Japanese and Korean using (of course)
encodings different from ISO-8859-1. Sometimes the HTML document describes:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp">
which declares that the page is written in ISO-2022-JP. However, there many pages without any declaration of encoding.
Web browsers have to deal with such a circumstance. Of course web browsers have to be able to deal with every encodings in the world which is listed in MIME. However, many web browsers can only deal with ASCII or ISO-8859-1. Such web browsers are useless at all for non-ASCII or non-ISO-8859-1 people.
URL should be written in ASCII character, though non-ASCII characters can be expressed using %nn sequence where nn is hexadecimal value. This is because there are no way to specify encoding. Wester-European people would treat it as ISO-8859-1, while Japanese people would treat it as EUC-JP or SHIFT-JIS.
Introduction to i18n
14 February 2003kubota@debian.org