=begin
What needs to be fixed here is the data, nothing else:
irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40
irb(main):002:> s.valid_encoding?
=> false
returning 2 for s.length may be called "somewhat more correct" than
returning 3, but in both cases, it's basically garbage in, garbage out.
Single (unpaired) surrogates are not characters in UTF-16. The most
correct answer might be "nil", in the sense of "sorry, wrong question".
The only reason #length just returns something, rather than throwing an
error, for the above case, is efficiency.
Regards, Martin.
On 2010/01/24 14:36, Tanaka Akira wrote:
2010/1/24 Vincent Isambartredmine@ruby-lang.org:
Bug #2636: Incorrect UTF-16 string length
http://redmine.ruby-lang.org/issues/show/2636
str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3
This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.
Fixed.
% ./ruby -ve '
s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
p s
p s.length'
ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]
"\xDC\x0B\xD8\x40"
2
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
=end