Feature #10770: chr and ord behavior for ill-formed byte sequences and surrogate code points - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #10770

open

chr and ord behavior for ill-formed byte sequences and surrogate code points

Feature #10770: chr and ord behavior for ill-formed byte sequences and surrogate code points

Added by masakielastic (Masaki Kagaya) about 11 years ago. Updated about 11 years ago.

Status:

Open

Assignee:

Target version:

[ruby-dev:48836]

Description

ord raises error when meeting ill-formed byte sequences, thus the difference of atttiute exists between each_char and each_codepoint.

str = "a\x80bc"
str.each_char {|c| puts c }
 # no error
str.each_codepoint {|c| puts c }
 # invalid byte sequence in UTF-8 (ArgumentError)

The one way of keeping consistency is change ord to return substitute code point such as 0xFFFD adopted by scrub.

Another problem about consitency is surrogate code points. Althouh CRuby allows to use surrogate code points in unicode literal, ord and chr don't allow them.

"\uD800".ord
 # invalid byte sequence in UTF-8 (ArgumentError)

0xD800.chr('UTF-8')
 # invalid codepoint 0xD800 in UTF-8 (RangeError)

How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3.

str = "\u{1F436}" # DOG FACE
cp = str.ord

if cp > 0x10000 then
  # http://unicode.org/faq/utf_bom.html#utf16-4
  lead = 0xD800 - (0x10000 >> 10) + (cp >> 10)
  trail = 0xDC00 + (cp & 0x3FF)
  ret = lead.chr('UTF-8') + trail.chr('UTF-8')
end

Updated by masakielastic (Masaki Kagaya) about 11 years ago Actions
Copy link
#1 [ruby-dev:48837]

This issue comes from discussion about mruby's behavior (https://github.com/mruby/mruby/issues/2708).

Updated by nobu (Nobuyoshi Nakada) about 11 years ago Actions
Copy link
#2 [ruby-dev:48839]

Description updated (diff)

Masaki Kagaya wrote:

str = "a\x80bc"
str.each_char {|c| puts c }
 # no error

Sounds like a bug of String#each_char, but maybe intensional.

The one way of keeping consistency is change ord to return substitute code point such as 0xFFFD adopted by scrub.

Implicit substitution doesn't feel a nice idea to me.

How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3.

Primarily, it's a responsibility of those bindings.

str.encode("UTF-16BE").unpack("v*").pack("U*")

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Feature #10770

chr and ord behavior for ill-formed byte sequences and surrogate code points

Updated by masakielastic (Masaki Kagaya) about 11 years ago Actions
Copy link
#1 [ruby-dev:48837]

Updated by nobu (Nobuyoshi Nakada) about 11 years ago Actions
Copy link
#2 [ruby-dev:48839]

Project

General

Profile

Ruby

Custom queries

Feature #10770

chr and ord behavior for ill-formed byte sequences and surrogate code points

Updated by masakielastic (Masaki Kagaya) about 11 years ago ActionsCopy link #1 [ruby-dev:48837]

Updated by nobu (Nobuyoshi Nakada) about 11 years ago ActionsCopy link #2 [ruby-dev:48839]

Updated by masakielastic (Masaki Kagaya) about 11 years ago Actions
Copy link
#1 [ruby-dev:48837]

Updated by nobu (Nobuyoshi Nakada) about 11 years ago Actions
Copy link
#2 [ruby-dev:48839]