Bug #5855
closedinconsistent treatment of 8 bit characters in US-ASCII
Description
=begin
Does Ruby allow 8 bit characters (127-255) in a US-ASCII encoded string, or not?
"\u{80}".encode("US-ASCII")      #=> Encoding::UndefinedConversionError
0x80.chr("US-ASCII")             #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128     #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
=end
        
           Updated by naruse (Yui NARUSE) almost 14 years ago
          Updated by naruse (Yui NARUSE) almost 14 years ago
          
          
        
        
      
      - Status changed from Open to Rejected
U+0080 of Unicode can't be mapped to 0x80 of US-ASCII.
In US-ASCII, the codepoint 0x80 exists, but doesn't define any character.
        
           Updated by john_firebaugh (John Firebaugh) almost 14 years ago
          Updated by john_firebaugh (John Firebaugh) almost 14 years ago
          
          
        
        
      
      Unless MRI has some non-standard definition of the term "codepoint", your second statement is incorrect. In US-ASCII, the codepoint 0x80 does not exist.
IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.
        
           Updated by naruse (Yui NARUSE) almost 14 years ago
          Updated by naruse (Yui NARUSE) almost 14 years ago
          
          
        
        
      
      - Tracker changed from Bug to Feature
- Status changed from Rejected to Assigned
- Assignee set to naruse (Yui NARUSE)
IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or
promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.
In other words,
"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError
For exapmle \u00A3, Pound Sign, US-ASCII clearly doesn't include it.
So it must Encoding::UndefinedConversionError.
0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)
In Ruby, a string is an 8 bit byte string.
So US-ASCII, 7 bit string, lives as 8bit string in Ruby.
So there is 0x80 even if it is invalid string.
"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
Maybe both of them should be ASCII-8BIT.
        
           Updated by john_firebaugh (John Firebaugh) almost 14 years ago
          Updated by john_firebaugh (John Firebaugh) almost 14 years ago
          
          
        
        
      
      =begin
Maybe both of them should be ASCII-8BIT.
I would prefer not, as then String#<< with an Integer ((|i|)) can't be defined as (({self << i.chr(self.encoding)})).
I think it would make much more sense for (({"".encode("US-ASCII") << 128})) and (({128.chr("US-ASCII")})) both to raise RangeError. The current behavior is just weird:
a = "".encode("US-ASCII") << 128
b = 128.chr("US-ASCII")
a == b #=> true
a.valid_encoding? #=> true
b.valid_encoding? #=> false
=end
        
           Updated by naruse (Yui NARUSE) almost 14 years ago
          Updated by naruse (Yui NARUSE) almost 14 years ago
          
          
        
        
      
      - Tracker changed from Feature to Bug
        
           Updated by naruse (Yui NARUSE) almost 14 years ago
          Updated by naruse (Yui NARUSE) almost 14 years ago
          
          
        
        
      
      - Status changed from Assigned to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r34236.
John, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
- 
numeric.c (rb_enc_uint_char): raise RangeError when added codepoint 
 is invalid. [Feature #5855] [Bug #5863] [Bug #5864]
- 
string.c (rb_str_concat): ditto. 
- 
string.c (rb_str_concat): set encoding as ASCII-8BIT when the string 
 is US-ASCII and the argument is an integer greater than 127.
- 
regenc.c (onigenc_mb2_code_to_mbclen): rearrange error code. 
- 
enc/euc_jp.c (code_to_mbclen): ditto. 
- 
enc/shift_jis.c (code_to_mbclen): ditto.