[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Multibyte encoding - what should a package provide?



kubota> Please note, Unicode is not popular at all in Asia. I am sure
kubota> there are very very few people using Unicode in Japan. Instead,
kubota> EUC-JP is popular for UNIX and SHIFT-JIS is the OS's coding
kubota> system for Windows/Macintosh in Japan.  I guess EUC-KR is popular
kubota> in Korea (Am I right? -- I guessed from http://www.debian.org/index.ko.html).

i think it might help if the reasons for not liking uniocde were
spelled out.

would anyone care to take this up?  even a reference specifying the
reasons in an asian language would be good for starters (someone can
translate it :-) ).

i've spent some time looking at this issue recently and i'm still not
certain of the reasons for the dislike.  here is my current
understanding.  please correct any mistakes.

  -there appear to be quite a few people who confuse unicode w/
   iso 10646 (i certainly didn't know the difference until i looked into
   it :-) ) -- if i understand correctly, unicode is pretty much
   a subset of iso 10646 -- it's basically ucs2, a fixed-width
   character set.  (there are different versions of unicode, so i presume
   one needs to be careful around phrases such as 'supports unicode')

  -ucs2 uses 16 bits -- that translates into about 65000 characters.
   this is not enough characters to cover all asian languages.  perhaps
   some actual numbers would be convincing :-)  assuming this is true,
   i think i see a reason for disliking unicode -- it doesn't appear to
   be enough for everybody.

  -there is at least one other part of iso 10646 called ucs4 -- it uses
   31 (yes, 31) bits per character.  this provides about 2 billion 
   characters -- has anyone heard whether this is regarded by anyone as not 
   enough?

  -last i heard, the only part of ucs4 defined is essentially what is
   defined in ucs2<->unicode (perhaps old info).  this (combined w/ some 
   earlier statements) means that iso 10646 is not necessarily the answer 
   for (from what i gather) a fair number of folks who must deal w/ the asian
   locale (yet?) -- there might be enough slots for characters, but if the 
   ones you need aren't there yet, it's not very helpful.  perhaps it's
   a matter of time?

  -the current approach in unicode and iso 10646 is to treat certain
   characters (appearance - glyphs?) from different languages as the same 
   character (byte representatoin - code point?).  supposedly, 'similar-
   looking enough' (for some definition) characters are treated as the same 
   character.

   the most often cited example i hear of is for kanji (roughly, 
   ideographs) -- some kanji from different locales are treated as 
   identical.  however, this is also true of characters used in
   european languages.  you can't tell an italian 'a' apart from an 
   english 'a' just by looking at the individual characters.  the approach
   appears to be at least consistent in this fashion.

   you might not care much in the case of 'a' because an italian
   'a' looks the same (last i checked) as an english 'a'.  but in the case
   of kanji, i believe this doesn't necessarily hold.  it's hard to
   give a concrete example in ascii text -- perhaps someone who is
   more familiar w/ the issues (or artistic?) can put up some .png images 
   somewhere to illustrate this point.

   i think this has consequences for trying to display documents that 
   contain characters from multiple languages -- for each character, how 
   do you decide which font to use if a character can be from several 
   different languages (and looks different depending on language?).

   you can probably come up w/ elaborate systems to deal w/ this, but it
   is not a simple matter of choosing a font based on only looking at
   each individual character.

   note that for european languages, it appears that no extra processing
   would be necessary for display because the characters look similar 
   enough.

it might help for someone who knows better to also explain that asian
languages are basically getting no (or very little) backward
compatibility w/ existing encoding methods.  (e.g. for japanese, if
you were using euc-jp, iso-2022-jp, or shift-jis (ugh) before, you
basically have to use tables to convert to ucs2/ucs4 -- there are no
'nice' transformations)

perhaps someone who knows better can explain utf8 (a transformation
that can be performed on ucs2, ucs4, and utf16?) and utf16 (a way of
using parts of ucs2 and ucs4 together?).

here is a question for folks in-the-know, is using uft8 on utf16 seen
as not enough to deal w/ asian locales?  even once ucs4 becomes more
fully specified?


Reply to: