Bug #20189
open`rb_str_resize` does not clear coderange when expanding
Description
Expanding string in some encoding (utf16 utf32) can change coderange to either valid or broken, but rb_str_resize does not clear coderange.
This will cause a bug in c-extension libraries that use rb_str_resize.
# Example for stringio
s = StringIO.new("\0".encode('UTF-16LE'))
s.truncate(1); s.truncate(2); s.string.valid_encoding?
#=> true
s.truncate(1); s.string.valid_encoding?; s.truncate(2); s.string.valid_encoding?
#=> false (expect to be true)
Updated by nobu (Nobuyoshi Nakada) 10 months ago
Does this happen only with wide-char encoding?
Updated by tompng (tomoya ishida) 10 months ago
I think so. sjis char does not end with null bytes, other encoding seems same too.
Encoding.list.select {|e|
256.times.any? do |first_byte|
a = first_byte.chr
b = a + "\0";
# only one of \x??\x00 and \x?? is valid
a.force_encoding(e).valid_encoding? != b.force_encoding(e).valid_encoding?
end
}
# => [#<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>]
It looks like there is no string like ("่กจ"(sjis)=="\x95\x5c") that satisfies "\x??\x00" is valid and "\x??" is not.
I opened a pull request https://github.com/ruby/ruby/pull/9552
Updated by nobu (Nobuyoshi Nakada) 10 months ago
Updated by nobu (Nobuyoshi Nakada) 10 months ago
- Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: REQUIRED, 3.3: REQUIRED
Updated by byroot (Jean Boussier) 10 months ago
Expanding string in some encoding (utf16 utf32) can change coderange to either valid or broken,
I must admit I'm not very familiar with wide char encodings, but this surprises me a bit. Ruby strings should always have their terminator, so I don't see how expanding a string would change their interpretation.
Updated by Eregon (Benoit Daloze) 10 months ago
byroot (Jean Boussier) wrote in #note-5:
I must admit I'm not very familiar with wide char encodings, but this surprises me a bit. Ruby strings should always have their terminator, so I don't see how expanding a string would change their interpretation.
It's because in UTF-16 if the number of bytes is not a multiple of 2 then it's CR_BROKEN. Same for UTF-32 if not a multiple of 4.
And since rb_str_resize()
changes the String#bytesize then that condition can change:
irb(main):002:0> "a".force_encoding(Encoding::UTF_16LE).valid_encoding?
=> false
irb(main):003:0> "a\x00".force_encoding(Encoding::UTF_16LE).valid_encoding?
=> true