Project

General

Profile

Actions

Bug #15635

closed

Inconsistent handling of dummy encodings and code range

Added by nirvdrum (Kevin Menard) about 5 years ago. Updated about 5 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
[ruby-core:91662]

Description

It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:

s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect

The output on Ruby 2.6.1p33 for me is:

UTF-8
Dummy: false
Valid: true
[97, 98, 99]

UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling force_encoding on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect force_encoding using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.

The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from rb_enc_str_coderange:

if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
	    cr = ENC_CODERANGE_BROKEN;
}

This unconditionally sets the code range to CR_BROKEN, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.

It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from String#encode.

The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like String#valid_encoding? for branching decisions. The latter because the runtime generally takes a slower path for operations on CR_BROKEN strings.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0