Bug #2636: Incorrect UTF-16 string length - Ruby - Ruby Issue Tracking System

Custom queries

Backport 3.2
Backport 3.3
Backport 3.4
bugs: unassigned
DevMeeting
matz
Open issues with attachment
Windows

Actions

Copy link

Bug #2636

closed

Incorrect UTF-16 string length

Added by scritch (Vincent Isambart) over 15 years ago. Updated over 14 years ago.

Status:

Closed

Assignee:

Target version:

1.9.2

ruby -v:

ruby 1.9.2dev (2010-01-22 trunk 26370) [x86_64-darwin10.2.0]

Backport:

[ruby-core:27748]

Description

=begin
str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3

This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.

The most strange part is that even though the length concurs with how the string is displayed when doing #inspect ("\xDC\u0BD8\x40"), but not with what #[] does. If the length is 3, then why does str[2] return nil?
=end

History
Notes
Property changes

Actions

Copy link

Updated by naruse (Yui NARUSE) over 15 years ago

Status changed from Open to Rejected

=begin
"\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.
=end

Actions

Copy link

Updated by naruse (Yui NARUSE) over 15 years ago

=begin
Or following will explain this:

"\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE)
=> "\xDC\u0BD8\x40"

=end

Actions

Copy link

Updated by duerst (Martin Dürst) over 15 years ago

=begin
What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40
irb(main):002:> s.valid_encoding?
=> false

returning 2 for s.length may be called "somewhat more correct" than
returning 3, but in both cases, it's basically garbage in, garbage out.
Single (unpaired) surrogates are not characters in UTF-16. The most
correct answer might be "nil", in the sense of "sorry, wrong question".

The only reason #length just returns something, rather than throwing an
error, for the above case, is efficiency.

Regards, Martin.

On 2010/01/24 14:36, Tanaka Akira wrote:

2010/1/24 Vincent Isambartredmine@ruby-lang.org:

Bug #2636: Incorrect UTF-16 string length
http://redmine.ruby-lang.org/issues/show/2636

str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3

This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.

Fixed.

% ./ruby -ve '
s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
p s
p s.length'
ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]
"\xDC\x0B\xD8\x40"
2

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

Actions

Copy link

Updated by scritch (Vincent Isambart) over 15 years ago

=begin

What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40"
irb(main):002:> s.valid_encoding?
=> false

Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

And after Tanaka Akira's fix, Ruby does exactly what I was expecting.

=end

Actions

Copy link

Updated by naruse (Yui NARUSE) over 15 years ago

Status changed from Rejected to Closed

=begin

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

Ah, I see. Current behavior seems correct.
=end

Actions

Copy link

Updated by duerst (Martin Dürst) over 15 years ago

=begin
On 2010/01/25 16:37, Vincent Isambart wrote:

What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40"
irb(main):002:> s.valid_encoding?
=> false

Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

And after Tanaka Akira's fix, Ruby does exactly what I was expecting.

I don't oppose Akira's fix, but expecting consistent output from
inconsistent input is essentially futile. I sincerely hope nobody will
add this case to a test suite or will claim that this is THE right way
to do things.

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #2636

Incorrect UTF-16 string length

Updated by naruse (Yui NARUSE) over 15 years ago

Updated by naruse (Yui NARUSE) over 15 years ago

Updated by duerst (Martin Dürst) over 15 years ago

Updated by scritch (Vincent Isambart) over 15 years ago

Updated by naruse (Yui NARUSE) over 15 years ago

Updated by duerst (Martin Dürst) over 15 years ago