Re: Multibyte encoding - what should a package provide?

To: debian-devel@lists.debian.org
Subject: Re: Multibyte encoding - what should a package provide?
From: Thomas Chan <thomas@atlas.datexx.com>
Date: Thu, 9 Sep 1999 03:50:21 -0400 (EDT)
Message-id: <[🔎] Pine.LNX.3.96.990909034418.19930C-100000@atlas.datexx.com>
Reply-to: Thomas Chan <tc31@cornell.edu>
(Apologies to anyone who is getting this twice; I originally posted to a
linux.debian.devel newsgroup, but the setup seems to be one-way; mailing
list -> newsgroup.)


On 8 Sep 99 02:27:35 GMT, sen_ml@eccosys.com <sen_ml@eccosys.com> wrote:
>kubota> Please note, Unicode is not popular at all in Asia. I am sure
>kubota> there are very very few people using Unicode in Japan. Instead,
>kubota> EUC-JP is popular for UNIX and SHIFT-JIS is the OS's coding
>kubota> system for Windows/Macintosh in Japan.  I guess EUC-KR is popular
>kubota> in Korea (Am I right? -- I guessed from
http://www.debian.org/index.ko.html).
>
>i think it might help if the reasons for not liking uniocde were
>spelled out.
>
>would anyone care to take this up?  even a reference specifying the
>reasons in an asian language would be good for starters (someone can
>translate it :-) ).

I'll try to comment on some of the items below...

I will use the phrase "CJK character" to mean hanzi/kanji/hanja (or
whatever other term you'd like to use).


>i've spent some time looking at this issue recently and i'm still not
>certain of the reasons for the dislike.  here is my current
>understanding.  please correct any mistakes.
>
>  -there appear to be quite a few people who confuse unicode w/
>   iso 10646 (i certainly didn't know the difference until i looked into
>   it :-) ) -- if i understand correctly, unicode is pretty much
>   a subset of iso 10646 -- it's basically ucs2, a fixed-width
>   character set.  (there are different versions of unicode, so i presume
>   one needs to be careful around phrases such as 'supports unicode')

I don't think this confusion is that serious; while Unicode is technically
a subset of ISO 10646, the content is currently the same as they are both
growing in parallel.  Unicode does impose extra rules on how some
characters behave, though.

One thing that should be clear, is which version of Unicode (currently
2.1x going on to 3.0) and which revision level of ISO 10646 (PDAMs?) one
is using.

Some people also prefer ISO 10646, thinking that Unicode is dominated
by US interests.

I will use the term "Unicode" below for ease of writing.


>  -ucs2 uses 16 bits -- that translates into about 65000 characters.
>   this is not enough characters to cover all asian languages.  perhaps
>   some actual numbers would be convincing :-)  assuming this is true,
>   i think i see a reason for disliking unicode -- it doesn't appear to
>   be enough for everybody.

This is one complaint, but the CJK characters in Unicode were compiled
from all the existing national character sets in use at the time (1993),
such as Japan's JIS X 0208 and JIS X 0212; South Korea's KS C 5601;
China's GB 2312; Taiwan's CNS 11643, etc, all of which contain much less
CJK characters than Unicode.  If a CJK character wasn't already in a
national standard, then it probably wasn't too important.  (And if it
really is crucial, why wasn't it fixed in the national standard(s)?)

The only exceptions are Vietnam and Hong Kong, who did not have
their CJK characters organized back in 1993 in time to be included.  Some,
but not all, are being added in CJK Extension A for Unicode 3.0.

There are also Surrogates, which extend the 16 bit space through a sort of
window, but I don't think they are implemented yet by anyone.


>  -there is at least one other part of iso 10646 called ucs4 -- it uses
>   31 (yes, 31) bits per character.  this provides about 2 billion
>   characters -- has anyone heard whether this is regarded by anyone as not
>   enough?

This should be enough for East Asian use.  The count of CJK characters top
out around 50,000 to 90,000, and a lot of those are very, very obscure.

The largest national character set in non-specialist use is Taiwan's Big5,
at ~13,000 CJK characters.  (But only 3000-4000 are needed for everyday
usage; maybe 5000-6000 if one is into literature or history.)  In
comparison, Unicode contains ~20,000 CJK characters, even after treating
similar-looking CJK characters as the same (see below).


>  -last i heard, the only part of ucs4 defined is essentially what is
>   defined in ucs2<->unicode (perhaps old info).  this (combined w/ some
>   earlier statements) means that iso 10646 is not necessarily the answer
>   for (from what i gather) a fair number of folks who must deal w/ the asian
>   locale (yet?) -- there might be enough slots for characters, but if the
>   ones you need aren't there yet, it's not very helpful.  perhaps it's
>   a matter of time?

It is a matter of time for anyone who wants their scripts and characters
to be included.  Submit your proposal...

http://www.unicode.org/pending/proposals.html


>  -the current approach in unicode and iso 10646 is to treat certain
>   characters (appearance - glyphs?) from different languages as the same
>   character (byte representatoin - code point?).  supposedly, 'similar-
>   looking enough' (for some definition) characters are treated as the same
>   character.
>
>   the most often cited example i hear of is for kanji (roughly,
>   ideographs) -- some kanji from different locales are treated as
>   identical.  however, this is also true of characters used in
>   european languages.  you can't tell an italian 'a' apart from an
>   english 'a' just by looking at the individual characters.  the approach
>   appears to be at least consistent in this fashion.
>
>   you might not care much in the case of 'a' because an italian
>   'a' looks the same (last i checked) as an english 'a'.  but in the case
>   of kanji, i believe this doesn't necessarily hold.  it's hard to
>   give a concrete example in ascii text -- perhaps someone who is
>   more familiar w/ the issues (or artistic?) can put up some .png images
>   somewhere to illustrate this point.

This is true, but ranks below backwards compatibility round-trip
conversion with existing character sets.

However, this is one point that is criticized.  e.g., the CJK character
meaning 'one' for Chinese yi, Japanese ichi, and Korean il is codepoint
U+4E00.  Some people think there should be separate characters; e.g., a
Chinese one, a Japanese one, and a Korean one.  While probably few dispute
over 'one', which looks like a horizontal line, there are some CJK characters
that some feel should not have been treated as the same because of
subtle differences.  They argue that Roman "A" and Cyrillic "A" were
not treated as the same, even though they look similar enough.

With the current approach of treating some CJK characters as similar,
there are ~22,000 in Unicode, condensed from an initial set of about
120,000.


>   i think this has consequences for trying to display documents that
>   contain characters from multiple languages -- for each character, how
>   do you decide which font to use if a character can be from several
>   different languages (and looks different depending on language?).
>
>   you can probably come up w/ elaborate systems to deal w/ this, but it
>   is not a simple matter of choosing a font based on only looking at
>   each individual character.

Part of the problem with this (continuation of above) is that some people
still make monolithic fonts, so when it comes to draw the CJK characters,
they must pick one of the East Asian countries' standards for drawing
them.  For example, the unifont .deb contains a Unicode font which has
primarily Japanese-looking CJK characters.  In the hardcopy Unicode 2.0
book, the font provided for printing was a Chinese font, which has led
Japanese readers to think the CJK characters in Unicode look "too Chinese"
and therefore unusable.  (This has been rectified by the online database
at http://charts.unicode.org/unihan.html , which shows examples of
how the same CJK character will look in each East Asian country, but
the damage has unfortunately already been done.)

There are some "elaborate designs", such as marking text with a language,
and then choosing an appropriate looking font, or appropriate glyphs
from a font.

Another fallacy of the way fonts are still being created is that they
assume one glyph for each codepoint, which does not work for any script
with combining letters, such as Thai, Arabic, IPA, decomposed Hangul,
etc where a font will actually have more glyphs than for codepoints,
for purposes of contextual rendering.


>   note that for european languages, it appears that no extra processing
>   would be necessary for display because the characters look similar
>   enough.

I know of one exception; Cyrillic and Glagothic (sp?) are treated as the
same.  But no one seems to complain, because Cyrillic is living, and
the other isn't.


>it might help for someone who knows better to also explain that asian
>languages are basically getting no (or very little) backward
>compatibility w/ existing encoding methods.  (e.g. for japanese, if
>you were using euc-jp, iso-2022-jp, or shift-jis (ugh) before, you
>basically have to use tables to convert to ucs2/ucs4 -- there are no
>'nice' transformations)

This is correct.  The only way to do it is through lookup tables.

However, only US-ASCII and ISO 8859-1 are forward compatible with Unicode;
other European character sets in use today still require lookup table
conversion as well.  (Any comments from ISO 8859-2 or KOI-8 users?)


>perhaps someone who knows better can explain utf8 (a transformation
>that can be performed on ucs2, ucs4, and utf16?) and utf16 (a way of
>using parts of ucs2 and ucs4 together?).

There's a manpage for "utf-8(7)" by Markus Kuhn.  The other Unicode-related
manpages by him are generally out of date, though (Unicode 1.1 or so).

UTF-8 encoding for the Unicode character set is the most popular,
as it doesn't get broken as badly on legacy software.  However, it
stores text as 1-3 bytes: ASCII as one byte, ISO 8859-1 as two bytes,
and CJK characters happen to take three bytes.  So, UTF-8 penalizes
anyone who doesn't use ASCII.

UCS-2 is a lot nicer; two bytes for every character consistently (easier
to program, compute file sizes, move back and forth in text, etc)--
this is what Word 97 uses.  Some European users happy with 8 bit
character sets may balk at the growth of their text though, but
the same problem exists with UTF-8, too.

It's not just an issue of Unicode not being popular in Asia, but
also in Europe as well.  Aren't the other localized Debian websites
in 8 bit character sets like ISO 8859-1, ISO 8859-2, KOI-8, etc?


>here is a question for folks in-the-know, is using uft8 on utf16 seen
>as not enough to deal w/ asian locales?  even once ucs4 becomes more
>fully specified?

Unicode's FAQ:
http://www.unicode.org/unicode/faq/index-2.html

A good book (well, practically the only one in the western world)
on East Asian i18n/l10n issues is Ken Lunde's _CJKV Information
Processing_ (1998), from O'Reilly.   ISBN 1-56592-224-7.


Thomas Chan
tc31@cornell.edu
Reply to:
Follow-Ups:
- Re: Multibyte encoding - what should a package provide?
  - From: sen_ml@eccosys.com
Prev by Date: RE: Debian Weekly News - September 7th, 1999
Next by Date: Re: Multibyte encoding - what should a package provide?
Previous by thread: Re: Multibyte encoding - what should a package provide?
Next by thread: Re: Multibyte encoding - what should a package provide?
Index(es):
- Date
- Thread