Project

General

Profile

Feature #10770

Updated by nobu (Nobuyoshi Nakada) almost 10 years ago

`ord` raises ord raise error when meeting ill-formed byte sequences, thus the difference of atttiute exists between `each_char` beteween each_char and `each_codepoint`. each_codepoint. 

 ~~~ruby <pre><code class="ruby"> 
 str = "a\x80bc" 
 str.each_char {|c| puts c } 
  # no error 
 str.each_codepoint {|c| puts c } 
  # invalid byte sequence in UTF-8 (ArgumentError) 
 ~~~ </code></pre> 

 The one way of keeping consistency is change `ord` ord to return substitute code point such as 0xFFFD adopted by `scrub`. scrub. 

 Another problem about consitency is surrogate code points. Althouh CRuby allows to use surrogate code points in unicode literal, `ord` ord and `chr` don't chr dont't allow them. 

 ~~~ruby <pre><code class="ruby"> 
 "\uD800".ord 
  # invalid byte sequence in UTF-8 (ArgumentError) 

 0xD800.chr('UTF-8') 
  # invalid codepoint 0xD800 in UTF-8 (RangeError) 
 ~~~ </code></pre> 

 How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3. 

 ~~~ruby <pre><code class="ruby"> 
 str = "\u{1F436}" # DOG FACE 
 cp = str.ord 

 if cp > 0x10000 then 
   # http://unicode.org/faq/utf_bom.html#utf16-4 
   lead = 0xD800 - (0x10000 >> 10) + (cp >> 10) 
   trail = 0xDC00 + (cp & 0x3FF) 
   ret = lead.chr('UTF-8') + trail.chr('UTF-8') 
 end 
 ~~~ </code></pre>

Back