On Sun, Jan 09, 2011 at 10:21:50PM +0000, Thorsten Glaser wrote: > Roger Leigh dixit: > > >From my reading of the standards a UTF-8 C locale would be required > >to behave identically to the existing ASCII C locale: > > > >• will consider all byte sequences valid > > I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work > this way, and it can’t be done with “just” the POSIX API > anyway because they aren’t allowed to not read any input > byte when outputting (in MirBSD, I’ve added a sister func- > tion to mbrtowc which can do that), so not everything can > be accepted in all situations. If you are using multibyte functions, then I agree these are special cases. For these to function correctly, they do require valid input. They would of course fail when run in a UTF-8 C locale. However, they should fail in an ASCII C locale as well (I should test this) given that the wide character representation is always UCS-4 on GNU/Linux and an e.g. latin1 sequence wouldn't be valid UTF-8. I think the "all byte sequences valid" applies mainly to narrow character I/O. i.e. printf/puts etc. won't alter, drop or otherwise mangle any non 7-bit-ASCII codes. i.e. I think the intent was to ensure 8-bit cleanliness in a 7-bit locale. This naturally extends to UTF-8. I'm not sure that wide character support is implied here, given that it implicity requires correct byte sequences to function where the narrow character I/O does not (all 8-bit codes are correct). Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
Attachment:
signature.asc
Description: Digital signature