Bug #8653: Unexpected result of String#succ with utf-16 and utf-32 string. - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #8653

closed

Unexpected result of String#succ with utf-16 and utf-32 string.

Bug #8653: Unexpected result of String#succ with utf-16 and utf-32 string.

Added by phasis68 (Heesob Park) over 12 years ago. Updated over 12 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 2.1.0dev (2013-07-17 trunk 42011) [i386-mingw32]

Backport:

1.9.3: UNKNOWN, 2.0.0: UNKNOWN

[ruby-core:56071]

Description

I found the result of String#succ of UTF-16LE encoded string is incorrect.

As a result, Range of UTF-16LE encoded string show some unexpected behavior.

C:\work>irb
irb(main):001:0> a = 'A'.encode('UTF-16LE')
=> "A"
irb(main):002:0> b = 'B'.encode('UTF-16LE')
=> "B"
irb(main):003:0> a.succ
=> "\u0141"
irb(main):004:0> r = a..b
=> "A".."B"
irb(main):005:0> r.to_s
=> "A\u2E2EB"
irb(main):006:0> r.count
=> 3
irb(main):007:0> r.to_a
=> ["A", "\u0141", "\u0241"]
irb(main):008:0> r.include?(b)
=> false
irb(main):009:0> a = 'A'.encode('UTF-32LE')
=> "A"
irb(main):010:0> b = 'B'.encode('UTF-32LE')
=> "B"
irb(main):011:0> a.succ
=> "\u{1000041}"
irb(main):012:0> r = a..b
=> "A".."B"
irb(main):013:0> r.to_s
=> "A\u{422E2E}\x00\x00"
irb(main):014:0> r.count
=> 16777217
irb(main):015:0> r.to_a
[FATAL] failed to allocate memory

C:\work>

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#1 [ruby-core:56073]

2013/7/18 phasis68 (Heesob Park) phasis@gmail.com:

Bug #8653: Unexpected result of String#succ with utf-16 and utf-32 string.
https://bugs.ruby-lang.org/issues/8653

I found the result of String#succ of UTF-16LE encoded string is incorrect.

As a result, Range of UTF-16LE encoded string show some unexpected behavior.

I don't say the bahavior is incorrect.

% ruby -e 'p "A".encode("UTF-16LE").force_encoding("ASCII-8BIT")'
"A\x00"
% ruby -e 'p "A".encode("UTF-16LE").succ.force_encoding("ASCII-8BIT")'
"A\x01"
% ruby -e 'p "B".encode("UTF-16LE").force_encoding("ASCII-8BIT")'
"B\x00"
% ruby -e 'p "B".encode("UTF-16LE").succ.force_encoding("ASCII-8BIT")'
"B\x01"

String#succ generates the bytewise lexicographicaly next characters
successfully.

I agree that is not intuitive.
But it is very difficult to define String#succ in encoding neutral way.¶

Tanaka Akira

Updated by phasis68 (Heesob Park) over 12 years ago Actions
Copy link
#2 [ruby-core:56075]

I understand String#succ is not easy for UTF-16LE encoded string.

In case of UTF-16 or UTF-32 string, it is possible to convert it to UTF-8 string and get succ value and revert it to the original encoding.

Here is a draft patch for rb_str_succ

diff --git a/string.c b/string.c.new
index f7a12e0..f933052 100644
--- a/string.c
+++ b/string.c.new
@@ -3032,6 +3032,7 @@ enc_succ_alnum_char(char *p, long len, rb_encoding *enc, char *carry)
VALUE
rb_str_succ(VALUE orig)
{

int idx;
rb_encoding *enc;
VALUE str;
char *sbeg, *s, *e, *last_alnum = 0;
@@ -3041,12 +3042,26 @@ rb_str_succ(VALUE orig)
long carry_pos = 0, carry_len = 1;
enum neighbor_char neighbor = NEIGHBOR_FOUND;

str = rb_str_new5(orig, RSTRING_PTR(orig), RSTRING_LEN(orig));
rb_enc_cr_str_copy_for_substr(str, orig);

idx = ENCODING_GET(orig);
switch(idx) {
```
   case ENCINDEX_UTF_16BE:
```
```
   case ENCINDEX_UTF_16LE:
```
```
   case ENCINDEX_UTF_32BE:
```
```
   case ENCINDEX_UTF_32LE:
```
```
   case ENCINDEX_UTF_16:
```
```
   case ENCINDEX_UTF_32:
```

       str = rb_str_encode(orig, rb_enc_from_encoding(rb_utf8_encoding()), 0, Qnil);

```
       break;
```
```
   default:
```

       str = rb_str_new5(orig, RSTRING_PTR(orig), RSTRING_LEN(orig));

       rb_enc_cr_str_copy_for_substr(str, orig);

}
OBJ_INFECT(str, orig);
if (RSTRING_LEN(str) == 0) return str;

enc = STR_ENC_GET(orig);

enc = STR_ENC_GET(str);
sbeg = RSTRING_PTR(str);
s = e = sbeg + RSTRING_LEN(str);

@@ -3066,6 +3081,15 @@ rb_str_succ(VALUE orig)
case NEIGHBOR_NOT_CHAR:
continue;
case NEIGHBOR_FOUND:

```
   switch(idx) {
```
```
       case ENCINDEX_UTF_16BE:
```
```
       case ENCINDEX_UTF_16LE:
```
```
       case ENCINDEX_UTF_32BE:
```
```
       case ENCINDEX_UTF_32LE:
```
```
       case ENCINDEX_UTF_16:
```
```
       case ENCINDEX_UTF_32:
```

           str = rb_str_encode(str, rb_enc_from_encoding(rb_enc_from_index(idx)), 0, Qnil);

   }      
  return str;
case NEIGHBOR_WRAPPED:
  last_alnum = s;

@@ -3103,6 +3127,17 @@ rb_str_succ(VALUE orig)
STR_SET_LEN(str, RSTRING_LEN(str) + carry_len);
RSTRING_PTR(str)[RSTRING_LEN(str)] = '\0';
rb_enc_str_coderange(str);
+

switch(idx) {
```
   case ENCINDEX_UTF_16BE:
```
```
   case ENCINDEX_UTF_16LE:
```
```
   case ENCINDEX_UTF_32BE:
```
```
   case ENCINDEX_UTF_32LE:
```
```
   case ENCINDEX_UTF_16:
```
```
   case ENCINDEX_UTF_32:
```

       str = rb_str_encode(str, rb_enc_from_encoding(rb_enc_from_index(idx)), 0, Qnil);

}
return str;
}

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#3

Status changed from Open to Closed
% Done changed from 0 to 100

This issue was solved with changeset r42078.
Heesob, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

string.c: wchar succ

string.c (enc_succ_char, enc_pred_char): consider wchar case.
[ruby-core:56071] [Bug #8653]
string.c (rb_str_succ): do not replace with invalid char.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Bug #8653

Unexpected result of String#succ with utf-16 and utf-32 string.

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#1 [ruby-core:56073]

I agree that is not intuitive.
But it is very difficult to define String#succ in encoding neutral way.¶

Updated by phasis68 (Heesob Park) over 12 years ago Actions
Copy link
#2 [ruby-core:56075]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#3

Project

General

Profile

Ruby

Tags

Custom queries

Bug #8653

Unexpected result of String#succ with utf-16 and utf-32 string.

Updated by akr (Akira Tanaka) over 12 years ago ActionsCopy link #1 [ruby-core:56073]

I agree that is not intuitive. But it is very difficult to define String#succ in encoding neutral way.¶

Updated by phasis68 (Heesob Park) over 12 years ago ActionsCopy link #2 [ruby-core:56075]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago ActionsCopy link #3

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#1 [ruby-core:56073]

I agree that is not intuitive.
But it is very difficult to define String#succ in encoding neutral way.¶

Updated by phasis68 (Heesob Park) over 12 years ago Actions
Copy link
#2 [ruby-core:56075]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#3