Bug #15635
closedInconsistent handling of dummy encodings and code range
Description
It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:
s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts
s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts
s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
The output on Ruby 2.6.1p33 for me is:
UTF-8
Dummy: false
Valid: true
[97, 98, 99]
UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling force_encoding
on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect force_encoding
using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.
The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from rb_enc_str_coderange
:
if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
cr = ENC_CODERANGE_BROKEN;
}
This unconditionally sets the code range to CR_BROKEN
, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.
It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from String#encode
.
The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like String#valid_encoding?
for branching decisions. The latter because the runtime generally takes a slower path for operations on CR_BROKEN
strings.