Bug #11410

Win32 Registry enumeration performs unnecessary string re-encoding which cause UndefinedConversionError exceptions

Added by Iristyle (Ethan Brown) about 4 years ago. Updated about 4 years ago.

Target version:
ruby -v:
ruby 2.1.5p273 (2014-11-13 revision 48405) [x64-mingw32]


When enumerating keys with Win32::Registry#each_key / Win32::Registry#keys or values with Win32::Registry#each_value / Win32::Registry#values, Ruby will take a UTF-16LE string returned from the Windows API and convert it to the local codepage. In the case of each_value, the string is then immediately converted back to UTF16-LE before being used in subsequent Windows API calls. Not only is this conversion unnecessary, but it may cause encoding exceptions when the local codepage does not support all of the characters present in the original Unicode string.

One such example of this is when a Unicode en-dash U+2013 appears in a string, and the local codepage is IBM437, which has no equivalent character. But this is just one of many examples that may trigger this behavior.

[1] pry(main)> RUBY_VERSION
=> "2.1.5"
[2] pry(main)> ENDASH_UTF_16 = [0x2013]
=> [8211]
[3] pry(main)> utf_16_str = ENDASH_UTF_16.pack('s*').force_encoding(Encoding::UTF_16LE)
=> "\u2013"
[4] pry(main)> utf_16_str.encode(Encoding::IBM437)
Encoding::UndefinedConversionError: U+2013 to IBM437 in conversion from UTF-16LE to UTF-8 to IBM437
from (pry):4:in `encode'

NOTE: Normal registry reads of a value at a particular key are not problematic - the bad behavior is triggered specifically during enumeration.

This is primarily as a result of the export_string function which re-encodes strings

It is used by each_value and each_key, which return UTF-16LE strings:

In the each_value method, this LOCALE re-encoded string is then passed to the read method, where it is turned back into a UTF16-LE string to be passed to RegQueryValueExW

Inside Puppet, we employed a solution that avoids Ruby's Win32::Registry when performing enumeration, and relies on internal helpers instead (avoiding unnecessary string encodings). This was unfortunate, but necessary:

Note also that we typically convert UTF-16LE strings to UTF-8 internally (since this is almost always guaranteed to be a lossless conversion), until we reach an end-user boundary where they absolutely need a specific encoding rendered. For instance, our version of read converts to UTF8:

I suggest that other locations where strings are re-encoded be examined for potential issues, as locale codepage conversions are generally considered dangerous given Win32 APIs use UTF-16LE.


Updated by Iristyle (Ethan Brown) about 4 years ago

I realized that I should have included some sample code demonstrating the problem:

require 'win32/registry'

ENDASH_UTF_16 = [0x2013]
TM_UTF_16 = [0x2122]

endash_utf_16_str = ENDASH_UTF_16.pack('s*').force_encoding(Encoding::UTF_16LE)
tm_utf_16_str = TM_UTF_16.pack('s*').force_encoding(Encoding::UTF_16LE)

def test_with_encoding(root, key_name, encoding)
  Encoding::default_internal = encoding

  puts "\n\nTesting with #{encoding.to_s}"

  puts "- Reading value #{root.parent.keyname}\\#{root.keyname}\\#{key_name}"
    value = root[key_name]
    puts "    - read value #{key_name} as #{value}"
  rescue Exception => e
    puts "    x failed to read from #{key_name}\n\t\t#{e}\n"

  puts " - Reading value #{root.parent.keyname}\\#{root.keyname}\\#{key_name}"
    type, value =
    puts "    - read value #{key_name} as type: #{type}, value: #{value}"
  rescue Exception => e
    puts "    x failed to read from #{key_name}\n\t\t#{e}\n"

  puts " - Enumerating Keys for #{root.parent.keyname}\\#{root.keyname}"
    root.each_key do |key, wtime|
      puts "    - read each_key #{key}"
  rescue Exception => e
    puts "    x failed to each_key from #{root.parent.keyname}\\#{root.keyname}\n\t\t#{e}\n"

  puts " - Enumerating Values for #{root.parent.keyname}\\#{root.keyname}"
    root.each_value do |name, type, value|
      puts "    - read each_value #{name} as type: #{type}, value: #{value}"
  rescue Exception => e
    puts "    x failed to each_value from #{root.parent.keyname}\\#{root.keyname}\n\t\t#{e}\n"


root = Win32::Registry::HKEY_CURRENT_USER
root.create('SOFTWARE\rubyfail') do |reg|
  # create subkey with trademark symbol

  # create endash value named foo
  reg.write('foo', Win32::Registry::REG_SZ, endash_utf_16_str)

  test_with_encoding(reg, 'foo', Encoding::WINDOWS_1252)

  # failures with both enumeration of keys and values
  test_with_encoding(reg, 'foo', Encoding::IBM437)

The important part is that you will failures in calling each_key and each_value when either contains characters that cannot be converted to the current codepage.

Updated by nobu (Nobuyoshi Nakada) about 4 years ago

  • Status changed from Open to Feedback

I agree that unnecessary conversions should be removed, but your code won't work yet, since the results will be expected in the locale encoding.

What do you want?

  1. it's OK
  2. return everything in UTF-8
  3. add optional parameter to specify the result encoding
  4. or others

Updated by Iristyle (Ethan Brown) about 4 years ago

I think the best solution here is to use UTF-8 strings wherever possible. If a program needs to use locale, then let the program decide to do that. I don't think Ruby should be making encoding decisions for a user like this, given Ruby is using wide character APIs and UTF-16LE strings.

While your proposed solution should work, I don't think the burden should be put on the calling code to always set an encoding value everywhere #each_value or #each_key is used, to prevent an exception. Keep in mind that #keys and #values call #each_keys and #each_value, but with your solution, there is no way to override the encoding when using those methods.

Also available in: Atom PDF