Bug #5855: inconsistent treatment of 8 bit characters in US-ASCII - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #5855

closed

inconsistent treatment of 8 bit characters in US-ASCII

Bug #5855: inconsistent treatment of 8 bit characters in US-ASCII

Added by john_firebaugh (John Firebaugh) almost 14 years ago. Updated almost 14 years ago.

Status:

Closed

Assignee:

naruse (Yui NARUSE)

Target version:

ruby -v:

Backport:

[ruby-core:41949]

Description

=begin
Does Ruby allow 8 bit characters (127-255) in a US-ASCII encoded string, or not?

"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError
0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
=end

Related issues 2 (0 open — 2 closed)

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#1 [ruby-core:41956]

Status changed from Open to Rejected

U+0080 of Unicode can't be mapped to 0x80 of US-ASCII.
In US-ASCII, the codepoint 0x80 exists, but doesn't define any character.

Updated by john_firebaugh (John Firebaugh) almost 14 years ago Actions
Copy link
#2 [ruby-core:41966]

Unless MRI has some non-standard definition of the term "codepoint", your second statement is incorrect. In US-ASCII, the codepoint 0x80 does not exist.

IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#3

Tracker changed from Bug to Feature
Status changed from Rejected to Assigned
Assignee set to naruse (Yui NARUSE)

IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or
promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.

In other words,

"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError

For exapmle \u00A3, Pound Sign, US-ASCII clearly doesn't include it.
So it must Encoding::UndefinedConversionError.

0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)

In Ruby, a string is an 8 bit byte string.
So US-ASCII, 7 bit string, lives as 8bit string in Ruby.
So there is 0x80 even if it is invalid string.

"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)

Maybe both of them should be ASCII-8BIT.

Updated by john_firebaugh (John Firebaugh) almost 14 years ago Actions
Copy link
#4 [ruby-core:41974]

=begin

Maybe both of them should be ASCII-8BIT.

I would prefer not, as then String#<< with an Integer ((|i|)) can't be defined as (({self << i.chr(self.encoding)})).

I think it would make much more sense for (({"".encode("US-ASCII") << 128})) and (({128.chr("US-ASCII")})) both to raise RangeError. The current behavior is just weird:

a = "".encode("US-ASCII") << 128
b = 128.chr("US-ASCII")
a == b #=> true
a.valid_encoding? #=> true
b.valid_encoding? #=> false

=end

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#5

Tracker changed from Feature to Bug

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#6

Status changed from Assigned to Closed
% Done changed from 0 to 100

This issue was solved with changeset r34236.
John, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

numeric.c (rb_enc_uint_char): raise RangeError when added codepoint
is invalid. [Feature #5855] [Bug #5863] [Bug #5864]
string.c (rb_str_concat): ditto.
string.c (rb_str_concat): set encoding as ASCII-8BIT when the string
is US-ASCII and the argument is an integer greater than 127.
regenc.c (onigenc_mb2_code_to_mbclen): rearrange error code.
enc/euc_jp.c (code_to_mbclen): ditto.
enc/shift_jis.c (code_to_mbclen): ditto.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Bug #5855

inconsistent treatment of 8 bit characters in US-ASCII

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#1 [ruby-core:41956]

Updated by john_firebaugh (John Firebaugh) almost 14 years ago Actions
Copy link
#2 [ruby-core:41966]

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#3

Updated by john_firebaugh (John Firebaugh) almost 14 years ago Actions
Copy link
#4 [ruby-core:41974]

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#5

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#6

Project

General

Profile

Ruby

Tags

Custom queries

Bug #5855

inconsistent treatment of 8 bit characters in US-ASCII

Updated by naruse (Yui NARUSE) almost 14 years ago ActionsCopy link #1 [ruby-core:41956]

Updated by john_firebaugh (John Firebaugh) almost 14 years ago ActionsCopy link #2 [ruby-core:41966]

Updated by naruse (Yui NARUSE) almost 14 years ago ActionsCopy link #3

Updated by john_firebaugh (John Firebaugh) almost 14 years ago ActionsCopy link #4 [ruby-core:41974]

Updated by naruse (Yui NARUSE) almost 14 years ago ActionsCopy link #5

Updated by naruse (Yui NARUSE) almost 14 years ago ActionsCopy link #6

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#1 [ruby-core:41956]

Updated by john_firebaugh (John Firebaugh) almost 14 years ago Actions
Copy link
#2 [ruby-core:41966]

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#3

Updated by john_firebaugh (John Firebaugh) almost 14 years ago Actions
Copy link
#4 [ruby-core:41974]

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#5

Updated by naruse (Yui NARUSE) almost 14 years ago Actions
Copy link
#6