[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n
Chapter 10 - the Internet

The Internet is a world-wide network of computer. Thus the text data exchanged via the Internet must be internationalized.

The concept of internationalization did not exist at the dawn of the Internet, since it was developed in US. Protocols used in the Internet were developed to be upward-compatible with the existing protocols.

One of the key technology of the internationalization of the Internet data exchange is MIME.

10.1 Mail/News

Internet mail uses SMTP (RFC 821) and ESMTP (RFC 1869) protocols. SMTP is 7bit protocol and ESMTP is 8bit.

Original SMTP can only send ASCII characters. Thus non-ASCII characters (ISO 8859-*, Asian characters, and so on) have to be converted into ASCII characters.

MIME (RFC 2045, 2046, 2047, 2048, and 2049) deals with this problem.

At first RFC 2045 determines three new headers.

Now MIME-Version is 1.0 and thus all MIME mails have a header like this:

     MIME-Version: 1.0

Content-Type describes the type of content. For example, an usual mail with Japanese text has a header like that:

     Content-Type: text/plain; charset="iso-2022-jp"

Available types are described in RFC 2046. Content-Transfer-Encoding describes the way to convert the contents. Available values are BINARY, 7bit, 8bit, BASE64, and QUOTED-PRINTABLE. Since SMTP cannot handle 8bit data, 8bit and BINARY cannot be used. ESMTP can use them. Base64 and quoted-printable are ways to convert 8bit data into 7bit and 8bit data have to be converted using either of them to sent by SMTP.

RFC 2046 describes media type and sub type for Content-Type header. Available types are text, image, audio, video, and application. Now we are interested in text because we are discussing about i18n. Sub types for text are plain, enriched, html, and so on. charset parameter can also be added to specify encodings. US-ASCII, ISO-8859-1, ISO-8859-2, ..., ISO-8859-10 are defined by RFC 2046 for charset. This list can be added by writing a new RFC.

RFC 2045 and and RFC 2046 determine the way to write non-ASCII characters in the main text of mail. On the other hand, RFC 2047 describes 'encoded words' which is the way to write non-ASCII characters in the header. It is like that: =?encoding?conversion algorithm?data?=, where encoding is selected from the list of charset of Content-Type header, algorithm is Q or q for quoted-printable or B or b for base64, and data is encoded data whose length is less than 76 bytes. If the data is longer than 75 bytes, it must be divided into multiple encoded words. For example,

     Subject: =?ISO-2022-JP?B?GyRCNEE7eiROJTUlViU4JSclLyVIGyhC?=

reads 'a subject written in Kanji' in Japanese (ISO-2022-JP, encoded by base64). Of course human cannot read it.

10.2 WWW

WWW is a system that HTML documents (mainly; and files in other formats) are transferred using HTTP protocol.

HTTP protocol is defined by RFC 2068. HTTP uses headers like mails and Content-Type header is used to describe the type of the contents. Though charset parameter can be described in the header, it is rarely used.

RFC 1866 describes that the default encoding for HTML is ISO-8859-1. However, many web pages are written in, for example, Japanese and Korean using (of course) encodings different from ISO-8859-1. Sometimes the HTML document describes:

     <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp">

which declares that the page is written in ISO-2022-JP. However, there many pages without any declaration of encoding.

Web browsers have to deal with such a circumstance. Of course web browsers have to be able to deal with every encodings in the world which is listed in MIME. However, many web browsers can only deal with ASCII or ISO-8859-1. Such web browsers are useless at all for non-ASCII or non-ISO-8859-1 people.

URL should be written in ASCII character, though non-ASCII characters can be expressed using %nn sequence where nn is hexadecimal value. This is because there are no way to specify encoding. Wester-European people would treat it as ISO-8859-1, while Japanese people would treat it as EUC-JP or SHIFT-JIS.

[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n

25 April 2014

Tomohiro KUBOTA debian at tmail dot plala dot or dot jp (retired DD)