Project

General

Profile

Actions

Bug #19383

open

Time.now.zone encoding for German display language in Windows is incorrect

Added by stringsn88keys (Thomas Powell) almost 2 years ago. Updated 16 days ago.

Status:
Assigned
Assignee:
Target version:
-
[ruby-core:112058]

Description

OS:
Verified on Windows 10 and Windows Server 2022 and Ruby 2.7.7 through 3.1.3

Display language:
Verified on German, but may impact other languages in which Time.now.zone returns characters that aren't [A-Za-z].

Time zone:
CET (UTC +01:00) Amsterdam, Berlin, ...

Time.now.zone # => "Mitteleuro\xE3ische Zeit"
Time.now.zone.encoding # => #Encoding:IBM437
puts Time.now.zone # => "Mitteleurop∑ische Zeit" (should be "Mitteleuropäische Zeit")
Time.now.zone.encode(Encoding::UTF_8) # => "Mitteleurop∑ische Zeit"

Doing a force_encoding on all encodings in Encoding.list reveals that ISO-8859-(1..16) and Windows-125(0,2,4,7) work to coerce the ä out of the time zone string:
Time.now.zone.force_encoding(Encoding::WINDOWS_1252) # => "Mitteleuro\xE3ische Zeit"
... but ...
Time.now.zone.force_encoding(Encoding::WINDOWS_1252).encode(Encoding::UTF_8) #=> "Mitteleuropäische Zeit"

Related issue: This improper encoding/rendering caused Ohai's JSON output to be unparseable. Workaround was forcing to Windows-1252.
https://github.com/chef/ohai/pull/1781

Updated by austin (Austin Ziegler) almost 2 years ago

It’s been a long time since I’ve used Windows, but the Windows console is notoriously stuck in 1980s encodings and using codepage 65001 should fix this in general. Otherwise, you’re going to get Windows 1252 encoding as your default input/output encoding even if Ruby is otherwise using UTF-8.

I believe that since Ruby 3.0, Ruby by default uses UTF-8 but the boundaries caused by your console codepage may be a confounding factor.

Updated by stringsn88keys (Thomas Powell) almost 2 years ago

By "console" do you mean irb are you referencing PowerShell or cmd.exe/Command Prompt? Windows Terminal produces the same results as well.

Also, the source for this is from one process to another without user interactivity.

Looking at the Code Page 437 vs. Windows-1252, 0xE4 would be ∑ in Code Page 437 and ä in Windows-1252

The byte sequence of "Mitteleuropäische Zeit" as encoded from Time.now.zone (which reports itself as "IBM437" is (hex values):
=> ["4d", "69", "74", "74", "65", "6c", "65", "75", "72", "6f", "70", "e4", "69", "73", "63", "68", "65", "20", "5a", "65", "69", "74"]

70 e4 69 would be "päi" in Windows-1252, but "p∑i" in IBM437 as reported. If UTF-8 is assumed, then e4 is a leading byte for a CJK script byte, but packing them doesn't associate the e4 with the following byte, which is confirmed by occasional invalid byte sequence errors depending on how the string is picked up.

Updated by austin (Austin Ziegler) almost 2 years ago

Yes, I mean cmd.exe or any other windows command-line. I repeat, that it has been years since I have used Windows in any serious manner, but this was the absolute bane of my existence, and process boundaries on Windows were nightmarish when I last dealt with Windows, primarily because of the emphasis on backwards compatibility at any cost. There does appear to be a bug if Time.now.zone is not returning the data in the code page expected for where it’s used (e.g., UTF-8).

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

  • Status changed from Open to Feedback

What are:

  • the output from chcp.com command
  • Encoding.locale_charmap

?

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Maybe msvcrt converts timezone names to ACP, not ConsoleCP.
If so, this patch may work, but I have no idea how to test this in a CI.

diff --git i/time.c w/time.c
index 9c4c93939e0..2e1a2dca29b 100644
--- i/time.c
+++ w/time.c
@@ -929,7 +929,7 @@ timegmw_noleapsecond(struct vtm *vtm)
 }
 
 static VALUE
-zone_str(const char *zone)
+zone_str_enc(const char *zone, rb_encoding *enc)
 {
     const char *p;
     int ascii_only = 1;
@@ -950,11 +950,18 @@ zone_str(const char *zone)
         str = rb_usascii_str_new(zone, len);
     }
     else {
-        str = rb_enc_str_new(zone, len, rb_locale_encoding());
+        if (!enc) enc = rb_locale_encoding();
+        str = rb_enc_str_new(zone, len, enc);
     }
     return rb_fstring(str);
 }
 
+static VALUE
+zone_str(const char *zone)
+{
+    return zone_str_enc(zone, NULL);
+}
+
 static void
 gmtimew_noleapsecond(wideval_t timew, struct vtm *vtm)
 {
@@ -1653,12 +1660,18 @@ localtime_with_gmtoff_zone(const time_t *t, struct tm *result, long *gmtoff, VAL
 #if defined(HAVE_TM_ZONE)
             *zone = zone_str(tm.tm_zone);
 #elif defined(HAVE_TZNAME) && defined(HAVE_DAYLIGHT)
+            rb_encoding *enc = NULL;
+# if defined(_WIN32)
+            char cp[(sizeof(UINT) * 8 / 3) + 4];
+            snprintf(cp, sizeof(cp), "CP%u", GetACP());
+            enc = rb_enc_find(cp);
+# endif
 # if defined(RUBY_MSVCRT_VERSION) && RUBY_MSVCRT_VERSION >= 140
 #  define tzname _tzname
 #  define daylight _daylight
 # endif
             /* this needs tzset or localtime, instead of localtime_r */
-            *zone = zone_str(tzname[daylight && tm.tm_isdst]);
+            *zone = zone_str_enc(tzname[daylight && tm.tm_isdst], enc);
 #else
             {
                 char buf[64];

Updated by stringsn88keys (Thomas Powell) almost 2 years ago

nobu (Nobuyoshi Nakada) wrote in #note-4:

What are:

  • the output from chcp.com command
  • Encoding.locale_charmap

?

chcp.com output:
"Aktive Codepage: 437." ("Active code page: 437" on English display language.)

Encoding.locale_charmap # => "CP437" (German and English)

Updated by stringsn88keys (Thomas Powell) almost 2 years ago

The top level status of this bug says "Closed" but last updated status says "Feedback". Can anyone clarify?

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

  • Status changed from Feedback to Assigned
  • Assignee set to windows

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

How about the patch at #note-5?

Updated by YO4 (Yoshinao Muramatsu) 16 days ago

There is another remaining locale related issue.
ref: https://bugs.ruby-lang.org/issues/20774

As an alternative solution, if setlocale(LC_CTYPE, ".65001") could be used, it would solve the problem, but it does not seem to work in all environments.
Also, could there be an impact on the extension library?

Another idea is to use SetProcessPreferredUILanguages or SetThreadPreferredUILanguages. It is possible to lock in a fallback language.
It's results may or may not be desirable. Similarly, the result will affect another program running in the same process.

Saying to #note-5 patch,
It seems to work fine, except for corner cases like setting Unicode-specific characters in TZ.
I also think that in today's environment there are many situations where we want UTF-8 results.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like1Like0Like0Like0Like0Like0