Project

General

Profile

Bug #2636

Incorrect UTF-16 string length

Added by scritch (Vincent Isambart) almost 11 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
ruby -v:
ruby 1.9.2dev (2010-01-22 trunk 26370) [x86_64-darwin10.2.0]
Backport:
[ruby-core:27748]

Description

=begin
str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3

This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.

The most strange part is that even though the length concurs with how the string is displayed when doing #inspect ("\xDC\u0BD8\x40"), but not with what #[] does. If the length is 3, then why does str[2] return nil?
=end

#1

Updated by naruse (Yui NARUSE) almost 11 years ago

  • Status changed from Open to Rejected

=begin
"\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.
=end

#2

Updated by naruse (Yui NARUSE) almost 11 years ago

=begin
Or following will explain this:

"\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE)
=> "\xDC\u0BD8\x40"

=end

#3

Updated by duerst (Martin Dürst) over 10 years ago

=begin
What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40
irb(main):002:> s.valid_encoding?
=> false

returning 2 for s.length may be called "somewhat more correct" than
returning 3, but in both cases, it's basically garbage in, garbage out.
Single (unpaired) surrogates are not characters in UTF-16. The most
correct answer might be "nil", in the sense of "sorry, wrong question".

The only reason #length just returns something, rather than throwing an
error, for the above case, is efficiency.

Regards, Martin.

On 2010/01/24 14:36, Tanaka Akira wrote:

2010/1/24 Vincent Isambartredmine@ruby-lang.org:

Bug #2636: Incorrect UTF-16 string length
http://redmine.ruby-lang.org/issues/show/2636

str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3

This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.

Fixed.

% ./ruby -ve '
s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
p s
p s.length'
ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]
"\xDC\x0B\xD8\x40"
2

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#4

Updated by scritch (Vincent Isambart) over 10 years ago

=begin

What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40"
irb(main):002:> s.valid_encoding?
=> false

Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

And after Tanaka Akira's fix, Ruby does exactly what I was expecting.

=end

#5

Updated by naruse (Yui NARUSE) over 10 years ago

  • Status changed from Rejected to Closed

=begin

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

Ah, I see. Current behavior seems correct.
=end

#6

Updated by duerst (Martin Dürst) over 10 years ago

=begin
On 2010/01/25 16:37, Vincent Isambart wrote:

What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40"
irb(main):002:> s.valid_encoding?
=> false

Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

And after Tanaka Akira's fix, Ruby does exactly what I was expecting.

I don't oppose Akira's fix, but expecting consistent output from
inconsistent input is essentially futile. I sincerely hope nobody will
add this case to a test suite or will claim that this is THE right way
to do things.

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

Also available in: Atom PDF