Bug #7090
closedUTF-16LE String#<< append 0x0 for certain codepoints
Description
$ irb193 -r unicode_utils/u
irb(main):001:0> RUBY_VERSION
=> "1.9.3"
irb(main):002:0> s1 = "".force_encoding('utf-16le')
=> ""
irb(main):003:0> s1 << 0x20
=> " "
irb(main):004:0> s1 << 0x300
=> " \u0000"
irb(main):005:0> U.debug s1
Char | Ordinal | Sid | General Category | UTF-8
------+---------+-------+------------------+-------
" " | 20 | SPACE | Space_Separator | 20
N/A | 0 | NULL | Control | 00
=> nil
irb(main):006:0> s2 = "".force_encoding('utf-8')
=> ""
irb(main):007:0> s2 << 0x20
=> " "
irb(main):008:0> s2 << 0x300
=> " ̀"
irb(main):009:0> U.debug s2
Char | Ordinal | Sid | General Category | UTF-8
------+---------+------------------------+------------------+-------
" " | 20 | SPACE | Space_Separator | 20
N/A | 300 | COMBINING GRAVE ACCENT | Nonspacing_Mark | CC 80
=> nil
IMO, the behaviour with the UTF-8 string is correct.
$ ri193 'String#<<'
= String#<<
(from ruby core)¶
str << integer -> str
str.concat(integer) -> str
str << obj -> str
str.concat(obj) -> str
Append---Concatenates the given object to str. If the object is a
Integer, it is considered as a codepoint, and is converted to a character
before concatenation.
a = "hello "
a << "world" #=> "hello world"
a.concat(33) #=> "hello world!"
AFAIK, a Ruby 1.9 string can be viewed as either 1) a sequence of raw bytes,
or 2) a sequence of codepoints.
Except for maybe regexes, Ruby has no higher level concept of a "character"
than a codepoint. Insofar I don't know what the "and is converted to
a character before concatenation" means.
If we take the sequence of codepoints view, than "str << integer" is simply
appending a codepoint.
If we take the sequence of bytes view, then "str << integer" is converting
the codepoint into a sequence of bytes that correspond to the codepoint
in str.encoding and appending that sequence of bytes.
Updated by stefan (Stefan Lang) about 12 years ago
UTF-16BE
irb(main):003:0> s = "".force_encoding('utf-16be')
=> ""
irb(main):004:0> s << 0x20
=> "\u0000"
irb(main):005:0> s << 0x300
=> "\u0000\u0300"
Updated by stefan (Stefan Lang) about 12 years ago
With older Ruby version: ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux]
the string correctly contains 0x20, 0x300 for UTF-8, UTF-16LE and UTF-16BE.
Updated by naruse (Yui NARUSE) about 12 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r37058.
Stefan, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
- string.c (rb_str_concat): use memcpy to copy a string which contains
NUL characters. [ruby-core:47751] [Bug #7090]