Bug #8653
closedUnexpected result of String#succ with utf-16 and utf-32 string.
Description
I found the result of String#succ of UTF-16LE encoded string is incorrect.
As a result, Range of UTF-16LE encoded string show some unexpected behavior.
C:\work>irb
irb(main):001:0> a = 'A'.encode('UTF-16LE')
=> "A"
irb(main):002:0> b = 'B'.encode('UTF-16LE')
=> "B"
irb(main):003:0> a.succ
=> "\u0141"
irb(main):004:0> r = a..b
=> "A".."B"
irb(main):005:0> r.to_s
=> "A\u2E2EB"
irb(main):006:0> r.count
=> 3
irb(main):007:0> r.to_a
=> ["A", "\u0141", "\u0241"]
irb(main):008:0> r.include?(b)
=> false
irb(main):009:0> a = 'A'.encode('UTF-32LE')
=> "A"
irb(main):010:0> b = 'B'.encode('UTF-32LE')
=> "B"
irb(main):011:0> a.succ
=> "\u{1000041}"
irb(main):012:0> r = a..b
=> "A".."B"
irb(main):013:0> r.to_s
=> "A\u{422E2E}\x00\x00"
irb(main):014:0> r.count
=> 16777217
irb(main):015:0> r.to_a
[FATAL] failed to allocate memory
C:\work>
Updated by akr (Akira Tanaka) over 11 years ago
2013/7/18 phasis68 (Heesob Park) phasis@gmail.com:
Bug #8653: Unexpected result of String#succ with utf-16 and utf-32 string.
https://bugs.ruby-lang.org/issues/8653
I found the result of String#succ of UTF-16LE encoded string is incorrect.
As a result, Range of UTF-16LE encoded string show some unexpected behavior.
I don't say the bahavior is incorrect.
% ruby -e 'p "A".encode("UTF-16LE").force_encoding("ASCII-8BIT")'
"A\x00"
% ruby -e 'p "A".encode("UTF-16LE").succ.force_encoding("ASCII-8BIT")'
"A\x01"
% ruby -e 'p "B".encode("UTF-16LE").force_encoding("ASCII-8BIT")'
"B\x00"
% ruby -e 'p "B".encode("UTF-16LE").succ.force_encoding("ASCII-8BIT")'
"B\x01"
String#succ generates the bytewise lexicographicaly next characters
successfully.
I agree that is not intuitive.
But it is very difficult to define String#succ in encoding neutral way.
Tanaka Akira
Updated by phasis68 (Heesob Park) over 11 years ago
I understand String#succ is not easy for UTF-16LE encoded string.
In case of UTF-16 or UTF-32 string, it is possible to convert it to UTF-8 string and get succ value and revert it to the original encoding.
Here is a draft patch for rb_str_succ
diff --git a/string.c b/string.c.new
index f7a12e0..f933052 100644
--- a/string.c
+++ b/string.c.new
@@ -3032,6 +3032,7 @@ enc_succ_alnum_char(char *p, long len, rb_encoding *enc, char *carry)
VALUE
rb_str_succ(VALUE orig)
{
- int idx;
rb_encoding *enc;
VALUE str;
char *sbeg, *s, *e, *last_alnum = 0;
@@ -3041,12 +3042,26 @@ rb_str_succ(VALUE orig)
long carry_pos = 0, carry_len = 1;
enum neighbor_char neighbor = NEIGHBOR_FOUND;
- str = rb_str_new5(orig, RSTRING_PTR(orig), RSTRING_LEN(orig));
- rb_enc_cr_str_copy_for_substr(str, orig);
- idx = ENCODING_GET(orig);
- switch(idx) {
-
case ENCINDEX_UTF_16BE:
-
case ENCINDEX_UTF_16LE:
-
case ENCINDEX_UTF_32BE:
-
case ENCINDEX_UTF_32LE:
-
case ENCINDEX_UTF_16:
-
case ENCINDEX_UTF_32:
-
str = rb_str_encode(orig, rb_enc_from_encoding(rb_utf8_encoding()), 0, Qnil);
-
break;
-
default:
-
str = rb_str_new5(orig, RSTRING_PTR(orig), RSTRING_LEN(orig));
-
rb_enc_cr_str_copy_for_substr(str, orig);
- }
- OBJ_INFECT(str, orig);
if (RSTRING_LEN(str) == 0) return str;
- enc = STR_ENC_GET(orig);
- enc = STR_ENC_GET(str);
sbeg = RSTRING_PTR(str);
s = e = sbeg + RSTRING_LEN(str);
@@ -3066,6 +3081,15 @@ rb_str_succ(VALUE orig)
case NEIGHBOR_NOT_CHAR:
continue;
case NEIGHBOR_FOUND:
-
switch(idx) {
-
case ENCINDEX_UTF_16BE:
-
case ENCINDEX_UTF_16LE:
-
case ENCINDEX_UTF_32BE:
-
case ENCINDEX_UTF_32LE:
-
case ENCINDEX_UTF_16:
-
case ENCINDEX_UTF_32:
-
str = rb_str_encode(str, rb_enc_from_encoding(rb_enc_from_index(idx)), 0, Qnil);
-
} return str; case NEIGHBOR_WRAPPED: last_alnum = s;
@@ -3103,6 +3127,17 @@ rb_str_succ(VALUE orig)
STR_SET_LEN(str, RSTRING_LEN(str) + carry_len);
RSTRING_PTR(str)[RSTRING_LEN(str)] = '\0';
rb_enc_str_coderange(str);
+
- switch(idx) {
-
case ENCINDEX_UTF_16BE:
-
case ENCINDEX_UTF_16LE:
-
case ENCINDEX_UTF_32BE:
-
case ENCINDEX_UTF_32LE:
-
case ENCINDEX_UTF_16:
-
case ENCINDEX_UTF_32:
-
str = rb_str_encode(str, rb_enc_from_encoding(rb_enc_from_index(idx)), 0, Qnil);
- }
- return str;
}
Updated by nobu (Nobuyoshi Nakada) over 11 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r42078.
Heesob, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
string.c: wchar succ
- string.c (enc_succ_char, enc_pred_char): consider wchar case.
[ruby-core:56071] [Bug #8653] - string.c (rb_str_succ): do not replace with invalid char.