Project

General

Profile

Actions

Feature #19191

open

Implicit console input transcoding is more desirable

Added by YO4 (Yoshinao Muramatsu) about 2 years ago. Updated about 1 month ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:111247]

Description

In response to Bug #18353, STDIN.internal_encoding are set and encoding is converted explcitly on Windows platform.
For example, [STDIN.external_encoding, STDIN.internal_encoding] # => [Encoding::Windows-31J, Encoding::UTF-8] if STDIN is console.

I feel that internal_encoding should be reserved for specific applications. And I think setting internal_encoding to STDIN is not foreseened.

Today I found irb breaks STDIN encoding, like

>ruby -rirb -e "p [$stdin.external_encoding, $stdin.internal_encoding]; IRB.setup(''); IRB::Irb.new(); p [$stdin.external_encoding, $stdin.internal_encoding]"
[#<Encoding:Windows-31J>, #<Encoding:UTF-8>]
[#<Encoding:UTF-8>, nil]

We know input has console code page encoding. So we always can convert encoding from console code page to io_input_encoding().

proposal

when reading from console on Windows, input encoding is enfoced to console code page and encoding conversion is implicitly applied.

when set_encoding("UTF-8") implicitly converts console code page to UTF-8.
when set_encoding("CP437", "UTF-8") implicitly converts console code page to UTF-8. external_encoding is ignored.

binmode or binary input method is not affected by these specifications.
set_encoding, etc. will continue to work as before, and this specification will affect only when encoding conversion on read (NEED_READCONV() and make_readconv()).

Updated by Eregon (Benoit Daloze) about 2 years ago

YO4 (Yoshinao Muramatsu) wrote:

when set_encoding("UTF-8") implicitly converts console code page to UTF-8.

I'm against more inconsistent corner cases like this for set_encoding.
Probably IRB should be fixed here to inherit the original $stdin external and internal encodings?

Updated by YO4 (Yoshinao Muramatsu) about 2 years ago

I agree that the IRB issue should be corrected on the part of the IRB.

My point was that for certain devices, external_encoding on read can be fixed to the device's specification.
In that case, external_encoding is not used when internal_encoding is specified,
and if only external_encoding is specified, it is treated as a conversion from device_encoding to external_encoding.

Input from the console will be treated as locale encoding.

Updated by YO4 (Yoshinao Muramatsu) about 2 years ago

Not sure if this is appropriate for this topic,
Consider the case where UTF-16 reading from the console will be supported in the future.

For explicit encoding

p [STDIN.external_encoding, STDIN.internal_encoding]
=> ["UTF-16LE", "UTF-8"].

For implicit encoding

p [STDIN.external_encoding, STDIN.internal_encoding]
=> ["UTF-8", nil].

And I think the console output implicitly uses UTF-16LE as device encoding.

Updated by YO4 (Yoshinao Muramatsu) about 1 month ago

irb changes $stdin.{external,internal}_encoding.
This causes gets() to no longer return the correct content in irb.

C:\>chcp
現在のコード ページ: 932

C:\>ruby -e "p [STDIN.external_encoding, STDIN.internal_encoding]"
[#<Encoding:Windows-31J>, #<Encoding:UTF-8>]

C:\>ruby -e "gets.then { p [_1, _1.encoding] }"
あ
["あ\n", #<Encoding:UTF-8>]

C:\>irb
irb(main):001> p [STDIN.external_encoding, STDIN.internal_encoding];
[#<Encoding:UTF-8>, nil]
irb(main):002> gets.then { p [_1, _1.encoding] };
あ
["\x82\xA0\n", #<Encoding:UTF-8>]
irb(main):003>

It seems that making changes on the irb side would have a negative impact on test, etc.
I think it is more reliable to deal with this on the ruby.exe side.

Updated by YO4 (Yoshinao Muramatsu) about 1 month ago

POC code here
https://github.com/ruby/ruby/pull/12055

However for actual implementation for Unicode input I recommend the method larskanis does in https://github.com/ruby/ruby/pull/11799 .

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0