IO#set_encoding behaves differently when processing a single String argument than it does when processing 2 arguments (whether Strings or Encodings) in the case where the external encoding is being set to binary and the internal encoding is being set to any other encoding.
This script demonstrates the resulting values of the external and internal encodings for an IO instance given different ways to equivalently call #set_encoding:
Can anyone confirm if this is a bug or intended behavior? I've taken a look at the code that implements this, and there are 2 pretty independent code paths for handling the single string argument case and the multiple argument case. If this is confirmed to be a bug, I would like to write a patch to unify the behavior.
I think it is a bug. I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/6280. Not sure if the approach taken is the best way, though.
Unless I'm mistaken, these are exactly the same as the last 3 lines of the modified example's output. The question remains as to why the single string argument case results in a nil internal encoding while the 2 argument cases do not.
Before investigating this, I thought that the logic would first split "binary:utf-8" into "binary" and "utf-8" and then proceed as in the 2 string argument case. In other words, I expected that all cases would result in the internal encoding being set to the same value, either nil or Encoding::UTF-8.
After more research, it appears the current behavior is expected. Parsing the single string with embedded colon is already handled correctly. However, if the external encoding is binary/ASCII-8BIT, then the internal encoding is deliberately set to nil:
// in rb_io_ext_int_to_encsif(ext==rb_ascii8bit_encoding()){/* If external is ASCII-8BIT, no transcoding */intern=NULL;}
Basically, the 'binary:utf-8' encoding doesn't make sense. Providing two encodings is done to transcode from one encoding to the other. There is no transcoding if the external encoding is binary. If you want the internal encoding to be UTF-8, then just use 'utf-8'.
That still leaves us with inconsistent behavior between 'binary:utf-8' and 'binary', 'utf-8'. So I propose to make the 'binary', 'utf-8' behavior the same as 'binary:utf-8'. I updated my pull request to do that: https://github.com/ruby/ruby/pull/6280
An alternative approach would be to remove the above code to treat the external encoding specially.
I've taken a look in IO#set_encoding recently and it's such an unreadable mess, I think nobody would be able to explain its full semantics.
So anything to simplify it would IMHO be welcome.
I think IO#set_encoding should simply set the internal/external encodings for that IO, with no special cases and not caring about the default external/internal encodings.
If some cases don't make any sense they should raise an exception.
Please also see #18995 for another example of the intricate implementation behaving unexpectedly. During my own investigation, I discovered that using "-" for the internal encoding name is silently ignored. According to the comments in the code, "-" is used to indicate no conversion, but it's completely undocumented for the method. If you use "-" for the external encoding name, you get similarly divergent behavior as reported for this issue if you pass "-:utf-8" vs. "-", "utf-8".
Naively, I would have expected "binary:utf-8" to take arbitrary input and force the encoding to UTF-8, and "utf-8:utf-8" to read and validate the input as UTF-8.
Neither does what I expected. ¯\_(ツ)_/¯
Make IO#set_encoding with binary external encoding use nil internal encoding
This was already the behavior when a single 'external:internal'
encoding specifier string was passed. This makes the behavior
consistent for the case where separate external and internal
encoding specifiers are provided.
While here, fix the IO#set_encoding method documentation to
state that either the first or second argument can be a string
with an encoding name, and describe the behavior when the
external encoding is binary.