Project

General

Profile

Feature #16604

Updated by larskanis (Lars Kanis) about 4 years ago

This issue is related to https://bugs.ruby-lang.org/issues/13488 where we already discussed the topic an postponed the change for ruby-3. Patch is here: https://github.com/ruby/ruby/pull/2877 

 Currently `Encoding.default_external` is initialized to the local console encoding of the Windows installation unless changed per option `-E`. This is e.g. cp850 for Western Europe. It should be changed to UTF-8. 

 RubyInstaller provided a checkbox for `RUBYOPT=-Eutf-8` since version 2.4. 
 This checkbox was disabled per default, but I noticed from bug reports, that many people enabled it. 
 With RubyInstaller-2.7.0 this checkbox is [enabled per default](https://rubyinstaller.org/2020/01/05/rubyinstaller-2.7.0-1-released.html). 
 So we already have a steady migration towards UTF-8 on Windows. 

 Changing to UTF-8 fixes various inconsistencies within ruby and with external tools. 
 A very annoying case is that writing a text to file writes the file content in UTF-8, since this is the default ruby source encoding. 
 But reading the content back, tags the string with the wrong encoding. 
 But not in `irb` since it already set `Encoding.default_external = "utf-8"` on it's own. 

 ``` 
 s = "äöü" 
 File.write("x", s)     # => 6 bytes 
 File.read("x") == s    # => true in irb but false in .rb file 
 ``` 

 Another issue is that many non-asian regions have distinct legacy encodings for OEM code page OEM-ANSI (aka `Encoding.find('locale')` ) and ANSI code page ASCII (aka `Encoding.find('filesystem')` ), so that a file written in current default external encoding `Encoding.find('locale')` is not properly interpret in Windows GUI tools like notepad. It is therefore uncommon to store files in OEM-ANSI encoding and doing so is almost certainly wrong. 

 RubyInstaller ships the MSYS2 environment, which defaults to UTF-8 as well. 

 Powershell made the switch to UTF-8 (without BOM) in [Powershell-6.0](https://docs.microsoft.com/en-us/powershell/scripting/whats-new/what-s-new-in-powershell-core-60?view=powershell-7#default-encoding-is-utf-8-without-a-bom-except-for-new-modulemanifest) and even more in 6.1. 

 Changing the default of `Encoding.default_external` to UTF-8 is a trade-off. 
 It doesn't fit to every case, but in my experience this is the best overall option. And it's just the default for the default, so that it can be overwritten in many ways. 

 There are some alternatives to it: 

 Changing the Windows console to code page codepage 65001: 
  * The Windows implementation of 65001 is buggy in the console. I didn't verify it lately but `chcp 65001` didn't work reliable years ago. 
  * It is not the default and input methods like IME are incompatible. 
  * It sets `locale` to UTF-8, so that the native console encoding isn't easily available. 

 Setting `Encoding.default_internal` in addition: 
  * This triggers transcoding of output strings, which is not enabled on other systems, causing unexpected results and incompatibilities. 

 Change ruby to use `Encoding.find("filesystem")` as encoding for file operations: 
  * That would fix the compatibility with some builtin Windows tools, but doesn't fix encoding issues due to increased use of UTF-8. 

 Please note that changing `Encoding.default_external` doesn't affect file or IO output, unless `Encoding.default_internal` is set as well (which is not the default). So inspecting ruby's output with Windows builtin `more` will most likely result in garbage (since strings are usually UTF-8 in ruby) regardless of the particular `default_external` setting. On the other hand output inspected with MSYS2 `less` is most likely correct, since it expects UTF-8 input. 

 Another thing that external encoding doesn't change is ruby's `locale` and `filesystem` encoding. Both can still be used explicit in cases where the legacy encoding is required. 

 The patch is currently about Windows only, because I would like to focus on that question for now. 
 Possibly it's a subsequent question whether Encoding.default_external should default to UTF-8 on all operating systems or at least in case of `LANG=C` locale (which currently triggers US-ASCII). 

Back