Reading a file with unknown encoding

To: Debian-Ruby <debian-ruby@lists.debian.org>
Subject: Reading a file with unknown encoding
From: Francesco Poli <invernomuto@paranoici.org>
Date: Thu, 6 Sep 2018 00:12:57 +0200
Message-id: <[🔎] 20180906001257.1ca794cdb6e9e27027d02f0e@paranoici.org>

Hello Debian Ruby experts,
I have a question related to encodings in Ruby.
Maybe the question is more fit for Ruby language mailing lists, but,
since the issue arises in apt-listbugs (which is a Debian native
package) and you are all nice, helpful and knowledgeable, I thought I
could ask here...


Description of the issue
========================

apt-listbugs reads a file ("ignore_bugs") where some bug numbers
and/or package names are written, along with comments beginning
with the '#' character.

A generic file in the same format could look like:

  $ cat ignore_bugs
  # first bug
  123456
  # secönd bug
  234567
  # a package
  my-package0+

This file is usually encoded in the same encoding used by environment
where apt-listbugs runs, so there's no special encoding issue.

  $ file ignore_bugs 
  ignore_bugs: UTF-8 Unicode text

The code that reads this file is similar to the following
minimal example script (except for the "p" debug statements,
of course):

  $ cat read_ignore_bugs.rb 
  #!/usr/bin/ruby
  
  p ["Default external encoding:", Encoding.default_external]
  puts "========="
  
  noncomments = []
  
  open("ignore_bugs").each { |line|
    p [line.encoding, line]
    if /^\s*#/ =~ line
      next
    end
    if /^\s*(\S+)/ =~ line
      noncomments << $1
    end
  }
  
  puts "========="
  noncomments.each { |elem|
    p [elem.encoding, elem]
  }

Running this script in a UTF-8 locale does not pose any issues:

  $ ./read_ignore_bugs.rb 
  ["Default external encoding:", #<Encoding:UTF-8>]
  =========
  [#<Encoding:UTF-8>, "# first bug\n"]
  [#<Encoding:UTF-8>, "123456\n"]
  [#<Encoding:UTF-8>, "# secönd bug\n"]
  [#<Encoding:UTF-8>, "234567\n"]
  [#<Encoding:UTF-8>, "# a package\n"]
  [#<Encoding:UTF-8>, "my-package0+\n"]
  =========
  [#<Encoding:UTF-8>, "123456"]
  [#<Encoding:UTF-8>, "234567"]
  [#<Encoding:UTF-8>, "my-package0+"]

However, there may be unusual cases where the file is written with an
encoding, but then read by apt-listbugs in an environment with
different locale settings, implying a different default external
encoding.
For instance, the file may be encoded in UTF-8 (either because it was
written by hand with an editor running in a UTF-8 locale, or because it
was written by apt-listbugs, when running in a UTF-8 locale), but then
read by a successive execution of apt-listbugs in a US-ASCII locale
(maybe because LC_ALL=C was set).
This encoding mismatch may cause an ArgumentError to be raised, if some
character is found in the file that is an invalid byte sequence in the
current default external encoding.

  $ LC_ALL=C ./read_ignore_bugs.rb 
  ["Default external encoding:", #<Encoding:US-ASCII>]
  =========
  [#<Encoding:US-ASCII>, "# first bug\n"]
  [#<Encoding:US-ASCII>, "123456\n"]
  [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n"]
  Traceback (most recent call last):
          2: from ./read_ignore_bugs.rb:8:in `<main>'
          1: from ./read_ignore_bugs.rb:8:in `each'
  ./read_ignore_bugs.rb:10:in `block in <main>': invalid byte sequence in US-ASCII (ArgumentError)


The problem is that the actual encoding of the file is unknown and
unpredictable...


Proposed strategy
=================

I've been thinking about a way to prevent apt-listbugs from
barfing in those unusual cases.

Since the non US-ASCII characters, if present at all, will be in
the comment lines (assuming the format of the file is valid!),
it does not really matter much whether apt-listbugs is able to
correctly represent those non US-ASCII characters.
The comment lines will be skipped, as soon as detected as such.
 
Hence I thought I could do the following:

  $ cat read_ignore_bugs_encode.rb 
  #!/usr/bin/ruby
  
  p ["Default external encoding:", Encoding.default_external]
  puts "========="
  
  noncomments = []
  
  open("ignore_bugs").each { |line|
    enc = line.encode(Encoding.default_external, undef: :replace, invalid: :replace)
    p [line.encoding, line, enc.encoding, enc]
    if /^\s*#/ =~ enc
      next
    end
    if /^\s*(\S+)/ =~ enc
      noncomments << $1
    end
  }
  
  puts "========="
  noncomments.each { |elem|
    p [elem.encoding, elem]
  }


This seems to work normally, when run in the same locale where the
"ignore_bugs" file was created:

  $ ./read_ignore_bugs_encode.rb
  ["Default external encoding:", #<Encoding:UTF-8>]
  =========
  [#<Encoding:UTF-8>, "# first bug\n", #<Encoding:UTF-8>, "# first bug\n"]
  [#<Encoding:UTF-8>, "123456\n", #<Encoding:UTF-8>, "123456\n"]
  [#<Encoding:UTF-8>, "# secönd bug\n", #<Encoding:UTF-8>, "# secönd bug\n"]
  [#<Encoding:UTF-8>, "234567\n", #<Encoding:UTF-8>, "234567\n"]
  [#<Encoding:UTF-8>, "# a package\n", #<Encoding:UTF-8>, "# a package\n"]
  [#<Encoding:UTF-8>, "my-package0+\n", #<Encoding:UTF-8>, "my-package0+\n"]
  =========
  [#<Encoding:UTF-8>, "123456"]
  [#<Encoding:UTF-8>, "234567"]
  [#<Encoding:UTF-8>, "my-package0+"]

but also when run in a more limited locale:

  $ LC_ALL=C ./read_ignore_bugs_encode.rb
  ["Default external encoding:", #<Encoding:US-ASCII>]
  =========
  [#<Encoding:US-ASCII>, "# first bug\n", #<Encoding:US-ASCII>, "# first bug\n"]
  [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"]
  [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n", #<Encoding:US-ASCII>, "# sec??nd bug\n"]
  [#<Encoding:US-ASCII>, "234567\n", #<Encoding:US-ASCII>, "234567\n"]
  [#<Encoding:US-ASCII>, "# a package\n", #<Encoding:US-ASCII>, "# a package\n"]
  [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>, "my-package0+\n"]
  =========
  [#<Encoding:US-ASCII>, "123456"]
  [#<Encoding:US-ASCII>, "234567"]
  [#<Encoding:US-ASCII>, "my-package0+"]


What do you think?
Is the above described strategy reasonable?
Or do you see a flaw which will backfire in the future?

Thanks for reading so far and for any help you may provide!


P.S.: Please Cc me on replies, as I am not subscribed to the list.
      Thanks for your understanding! 

-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

Attachment: pgpblWYDAEebC.pgp
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Reading a file with unknown encoding
  - From: akira yamada <akira@debian.org>
- Re: Reading a file with unknown encoding
  - From: Antonio Terceiro <terceiro@debian.org>

Next by Date: Re: Reading a file with unknown encoding
Next by thread: Re: Reading a file with unknown encoding
Index(es):
- Date
- Thread