Bug #18973
closedKernel#sprintf: %c allows codepoints above 127 for 7-bits ASCII encoding
Description
I've noticed the following behavior:
sprintf("%c".encode("US-ASCII"), 128)
=> "\x80"
sprintf("%c".encode("US-ASCII"), 128).valid_encoding?
=> false
Specifying codepoints 128-255 for ASCII encoded formatting sequence leads to a broken string.
sprintf("%c".encode("US-ASCII"), 255)
=> "\xFF"
sprintf("%c".encode("US-ASCII"), 256)
(irb):17:in `sprintf': 256 out of char range (RangeError)
Specifying codepoint greater that 255 causes the expected exception out of char range.
I suppose this exception should be raised for codepoints 128-255 as well (for ASCII encoding).
        
          
          Updated by Eregon (Benoit Daloze) about 3 years ago
          
          
        
        
      
      I noticed https://github.com/ruby/ruby/blob/master/benchmark/app_aobench.rb seems to rely on this behavior.
But that is easily fixed by using # coding: BINARY instead of # coding: US-ASCII.
I think it would be good to fix this issue, so sprintf("%c".encode("US-ASCII"), 128) is out of char range (RangeError), just like it is an exception for:
> 128.chr(Encoding::US_ASCII)
(irb):2:in `chr': invalid codepoint 0x80 in US-ASCII (RangeError)
        
          
          Updated by jeremyevans0 (Jeremy Evans) about 3 years ago
          
          
        
        
      
      I submitted a pull request to fix this: https://github.com/ruby/ruby/pull/6276
        
          
          Updated by nobu (Nobuyoshi Nakada) about 3 years ago
          
          
        
        
      
      diff --git a/regenc.c b/regenc.c
index 16d62fdf409..5cc3b778351 100644
--- a/regenc.c
+++ b/regenc.c
@@ -627,6 +627,10 @@ onigenc_single_byte_mbc_to_code(const UChar* p, const UChar* end ARG_UNUSED,
 extern int
 onigenc_single_byte_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
 {
+#ifdef RUBY
+  if (code > 0xff)
+    return 0;
+#endif
   return 1;
 }
 
        
          
          Updated by nobu (Nobuyoshi Nakada) about 3 years ago
          
          
        
        
      
      Sorry, this.
diff --git a/enc/us_ascii.c b/enc/us_ascii.c
index 08f9072c435..9d854b12245 100644
--- a/enc/us_ascii.c
+++ b/enc/us_ascii.c
@@ -7,6 +7,12 @@
 # define ENCINDEX_US_ASCII 0
 #endif
 
+static int
+us_ascii_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
+{
+  return !(code & 0x80);
+}
+
 static int
 us_ascii_mbc_enc_len(const UChar* p, const UChar* e, OnigEncoding enc)
 {
@@ -22,7 +28,7 @@ OnigEncodingDefine(us_ascii, US_ASCII) = {
   1,           /* min byte length */
   onigenc_is_mbc_newline_0x0a,
   onigenc_single_byte_mbc_to_code,
-  onigenc_single_byte_code_to_mbclen,
+  us_ascii_code_to_mbclen,
   onigenc_single_byte_code_to_mbc,
   onigenc_ascii_mbc_case_fold,
   onigenc_ascii_apply_all_case_fold,
        
          
          Updated by Eregon (Benoit Daloze) about 3 years ago
          
          
        
        
      
      @nobu (Nobuyoshi Nakada) Looks good to me, could you commit it?
(ARG_UNUSED is not needed on code I think)
        
          
          Updated by mame (Yusuke Endoh) about 3 years ago
          
          
        
        
      
      I am concerned about compatibility to change this. @naruse (Yui NARUSE) proposed to return an ASCII-8BIT string instead of raising an exception.
        
          
          Updated by mame (Yusuke Endoh) about 3 years ago
          
          
        
        
      
      We brielfly discussed this issue at the dev meeting. @naruse (Yui NARUSE) said it should behave like String#<< as follows. @nobu (Nobuyoshi Nakada) said he would try to implement this.
s = "".force_encoding("US-ASCII")
s << 128
p s          #=> "\x80"
p s.encoding #=> #<Encoding:ASCII-8BIT>
s = "".force_encoding("US-ASCII")
s << 256 #=> 256 out of char range (RangeError)
        
          
          Updated by nobu (Nobuyoshi Nakada) about 3 years ago
          
          
        
        
      
      - Status changed from Open to Closed
 
Applied in changeset git|576bdec03f0d58847690a0607c788ada433ce60f.
[Bug #18973] Promote US-ASCII to ASCII-8BIT when adding 8-bit char
        
          
          Updated by nobu (Nobuyoshi Nakada) over 1 year ago
          
          
        
        
      
      - Related to Bug #20566: string << 0xC2 should raise a RangeError if the string encoding is Encoding::ASCII added