Bug #18899
closedInconsistent argument handling in IO#set_encoding
Description
IO#set_encoding
behaves differently when processing a single String argument than it does when processing 2 arguments (whether Strings or Encodings) in the case where the external encoding is being set to binary and the internal encoding is being set to any other encoding.
This script demonstrates the resulting values of the external and internal encodings for an IO instance given different ways to equivalently call #set_encoding
:
#!/usr/bin/env ruby
def show(io, args)
printf(
"args: %-50s external encoding: %-25s internal encoding: %-25s\n",
args.inspect,
io.external_encoding.inspect,
io.internal_encoding.inspect
)
end
File.open('/dev/null') do |f|
args = ['binary:utf-8']
f.set_encoding(*args)
show(f, args)
args = ['binary', 'utf-8']
f.set_encoding(*args)
show(f, args)
args = [Encoding.find('binary'), Encoding.find('utf-8')]
f.set_encoding(*args)
show(f, args)
end
This behavior is the same from Ruby 2.7.0 to 3.1.2.
Updated by javanthropus (Jeremy Bopp) over 2 years ago
- Description updated (diff)
Updated by javanthropus (Jeremy Bopp) about 2 years ago
Can anyone confirm if this is a bug or intended behavior? I've taken a look at the code that implements this, and there are 2 pretty independent code paths for handling the single string argument case and the multiple argument case. If this is confirmed to be a bug, I would like to write a patch to unify the behavior.
Updated by jeremyevans0 (Jeremy Evans) about 2 years ago
I think it is a bug. I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/6280. Not sure if the approach taken is the best way, though.
Updated by javanthropus (Jeremy Bopp) about 2 years ago
I ran my test against your branch, and it addresses this issue. I hope it can be incorporated soon. Thanks!
Updated by naruse (Yui NARUSE) almost 2 years ago
I think your example needs to be as follows:
#!/usr/bin/env ruby
def show(io, args)
printf(
"args: %-50s external encoding: %-25s internal encoding: %-25s\n",
args.inspect,
io.external_encoding.inspect,
io.internal_encoding.inspect
)
end
File.open('/dev/null', 'r:binary:utf-8') do |f|
args = ['r:binary:utf-8']
show(f, args)
args = ['binary:utf-8']
f.set_encoding(*args)
show(f, args)
args = ['binary', 'utf-8']
f.set_encoding(*args)
show(f, args)
args = [Encoding.find('binary'), Encoding.find('utf-8')]
f.set_encoding(*args)
show(f, args)
end
The result will be
args: ["r:binary:utf-8"] external encoding: #<Encoding:ASCII-8BIT> internal encoding: nil
args: ["binary:utf-8"] external encoding: #<Encoding:ASCII-8BIT> internal encoding: nil
args: ["binary", "utf-8"] external encoding: #<Encoding:ASCII-8BIT> internal encoding: #<Encoding:UTF-8>
args: [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>] external encoding: #<Encoding:ASCII-8BIT> internal encoding: #<Encoding:UTF-8>
Updated by javanthropus (Jeremy Bopp) almost 2 years ago
Thank you for your response. How do the changes to the example make a difference? The results with the original example are:
args: ["binary:utf-8"] external encoding: #<Encoding:ASCII-8BIT> internal encoding: nil
args: ["binary", "utf-8"] external encoding: #<Encoding:ASCII-8BIT> internal encoding: #<Encoding:UTF-8>
args: [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>] external encoding: #<Encoding:ASCII-8BIT> internal encoding: #<Encoding:UTF-8>
Unless I'm mistaken, these are exactly the same as the last 3 lines of the modified example's output. The question remains as to why the single string argument case results in a nil
internal encoding while the 2 argument cases do not.
Before investigating this, I thought that the logic would first split "binary:utf-8"
into "binary"
and "utf-8"
and then proceed as in the 2 string argument case. In other words, I expected that all cases would result in the internal encoding being set to the same value, either nil
or Encoding::UTF-8
.
Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago
After more research, it appears the current behavior is expected. Parsing the single string with embedded colon is already handled correctly. However, if the external encoding is binary/ASCII-8BIT, then the internal encoding is deliberately set to nil
:
// in rb_io_ext_int_to_encs
if (ext == rb_ascii8bit_encoding()) {
/* If external is ASCII-8BIT, no transcoding */
intern = NULL;
}
Basically, the 'binary:utf-8'
encoding doesn't make sense. Providing two encodings is done to transcode from one encoding to the other. There is no transcoding if the external encoding is binary. If you want the internal encoding to be UTF-8, then just use 'utf-8'
.
That still leaves us with inconsistent behavior between 'binary:utf-8'
and 'binary', 'utf-8'
. So I propose to make the 'binary', 'utf-8'
behavior the same as 'binary:utf-8'
. I updated my pull request to do that: https://github.com/ruby/ruby/pull/6280
An alternative approach would be to remove the above code to treat the external encoding specially.
Updated by Eregon (Benoit Daloze) almost 2 years ago
I've taken a look in IO#set_encoding
recently and it's such an unreadable mess, I think nobody would be able to explain its full semantics.
So anything to simplify it would IMHO be welcome.
I think IO#set_encoding
should simply set the internal/external encodings for that IO, with no special cases and not caring about the default external/internal encodings.
If some cases don't make any sense they should raise an exception.
Updated by javanthropus (Jeremy Bopp) almost 2 years ago
Please also see #18995 for another example of the intricate implementation behaving unexpectedly. During my own investigation, I discovered that using "-"
for the internal encoding name is silently ignored. According to the comments in the code, "-"
is used to indicate no conversion, but it's completely undocumented for the method. If you use "-"
for the external encoding name, you get similarly divergent behavior as reported for this issue if you pass "-:utf-8"
vs. "-"
, "utf-8"
.
Updated by Dan0042 (Daniel DeLorme) almost 2 years ago
Naively, I would have expected "binary:utf-8" to take arbitrary input and force the encoding to UTF-8, and "utf-8:utf-8" to read and validate the input as UTF-8.
Neither does what I expected. ¯\_(ツ)_/¯
Updated by jeremyevans (Jeremy Evans) almost 2 years ago
- Status changed from Open to Closed
Applied in changeset git|0903a251796c2b4086804a94420c231c04e3cea1.
Make IO#set_encoding with binary external encoding use nil internal encoding
This was already the behavior when a single 'external:internal'
encoding specifier string was passed. This makes the behavior
consistent for the case where separate external and internal
encoding specifiers are provided.
While here, fix the IO#set_encoding method documentation to
state that either the first or second argument can be a string
with an encoding name, and describe the behavior when the
external encoding is binary.
Fixes [Bug #18899]