Actions
Feature #10770
openchr and ord behavior for ill-formed byte sequences and surrogate code points
Status:
Open
Assignee:
-
Target version:
-
Description
ord
raises error when meeting ill-formed byte sequences, thus the difference of atttiute exists between each_char
and each_codepoint
.
str = "a\x80bc"
str.each_char {|c| puts c }
# no error
str.each_codepoint {|c| puts c }
# invalid byte sequence in UTF-8 (ArgumentError)
The one way of keeping consistency is change ord
to return substitute code point such as 0xFFFD adopted by scrub
.
Another problem about consitency is surrogate code points. Althouh CRuby allows to use surrogate code points in unicode literal, ord
and chr
don't allow them.
"\uD800".ord
# invalid byte sequence in UTF-8 (ArgumentError)
0xD800.chr('UTF-8')
# invalid codepoint 0xD800 in UTF-8 (RangeError)
How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3.
str = "\u{1F436}" # DOG FACE
cp = str.ord
if cp > 0x10000 then
# http://unicode.org/faq/utf_bom.html#utf16-4
lead = 0xD800 - (0x10000 >> 10) + (cp >> 10)
trail = 0xDC00 + (cp & 0x3FF)
ret = lead.chr('UTF-8') + trail.chr('UTF-8')
end
Actions
Like0
Like0Like0