Feature #1106
closedScript encoding vs. default_internal: Implicitly transcode strings/regexps
Description
=begin
If I'm not mistaken, a related issue was discussed in the past (eg [1]). Anyway, please take a sec and consider the following scripts and input files:
FILE: test2.rb:
encoding: UTF-8¶
Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8
require 'test2a'
File.readlines('test2.txt').each do |line|
p line, test2a(line)
end
FILE: test2a.rb
encoding: ISO-8859-1¶
p ENCODING
def test2a(x)
x =~ /[äöüÄÖÜß]/
end
FILE: test.txt (uft8 byte sequences; the second line should read "weiß", the third one "Bär" in UTF-8 encoding)
foo
weiß
Bär
bar
If I run
$ ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-cygwin]
$ ruby test2.rb
#Encoding:ISO-8859-1
"foo\n"
nil
/home/t/src/tmp/test2a.rb:6:in test2a': invalid byte sequence in UTF-8 (ArgumentError) from test2.rb:9:in
block in '
from test2.rb:8:in each' from test2.rb:8:in
'
It seems the ISO-8859-1 encoded regexp in test2a.rb /[äöüÄÖÜß]/, is not transcoded to UTF-8. But since default_internal is set to UFT-8, ruby seems to expect a valid UTF-8 string. Please forgive me if my interpretation of that error message is wrong. It is quite possible that I missed something and that there already exists an easy solution to this problem, which I don't know of. If that is the case, I kindly ask you to tell me about it.
If this is the way, ruby 1.9.1 currently is supposed to work, I would humbly suggest to silently transcode all strings found in scripts to default_internal if non-nil. IMHO not transcoding strings doesn't make any sense and drives users who work with heterogeneous files to madness. If a string cannot be transcoded to default_internal, an error should be raised. Thanks.
[1] http://groups.google.com/group/ruby-core-google/browse_frm/thread/d6474429dd112926?hl=en
=end