Bug #564
closedRegexp fails on UTF-16 & UTF-32 character encodings
Description
=begin
UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings) don't seem to be work as Regexp patterns.
Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings: US-ASCII and UTF-16BE
=end
Updated by matz (Yukihiro Matsumoto) over 16 years ago
- Status changed from Open to Rejected
=begin
=end
Updated by naruse (Yui NARUSE) over 16 years ago
=begin
Hi,
James Gray wrote:
On Sep 15, 2008, at 3:49 AM, Michael Selig wrote:
On Mon, 15 Sep 2008 18:08:14 +1000, Tanaka Akira akr@fsij.org wrote:
In article 48cddb5533ad_8725cd9524342@redmine.ruby-lang.org,
Michael Selig redmine@ruby-lang.org writes:UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings)
don't seem to be work as Regexp patterns.Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings:
US-ASCII and UTF-16BE% ruby -ve 'p Regexp.new("abc".encode("UTF-16BE")) =~
"abc".encode("UTF-16BE")'
ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux]
0I see, I have diagnosed the problem wrongly. I was using irb.
ruby -ve 'p Regexp.new("abc".encode("UTF-16BE"))'
ruby 1.9.0 (2008-09-03 revision 19073) [i686-linux]
-e:1:inp': incompatible character encodings: UTF-16BE and ASCII-8BIT (EncodingCompatibilityError) from -e:1:in
'This is the error I was getting in irb, and I mistakenly assumed it
was from the Regexp::new.
It is a different problem - not as bad as I thought!So it's inspect() that has the issues, right?
YES, a reason of this problem is Regexp#inspect.
So a patch is following.
--- re.c (revision 19371)
+++ re.c (working copy)
@@ -381,7 +381,7 @@ rb_reg_desc(const char *s, long len, VAL
{
VALUE str = rb_str_buf_new2("/");
- rb_enc_copy(str, re);
- rb_enc_associate(str, rb_usascii_encoding());
rb_reg_expr_str(str, s, len);
rb_str_buf_cat2(str, "/");
if (re) {
The result of Regexp#inspect is only for see the content of regexp to debug,
so there may be no reason to keep original encoding.
Of course Regexp#source must keep it.¶
Anyway, Regexp#to_s is alias of Regexp#source now.
But Regexp#inspect is more readble.
How about make Regexp#to_s as alias of Regexp#inspect ?
-
r1 = /ab+c/ix #=> /ab+c/ix
-
s1 = r1.to_s #=> "(?ix-m:ab+c)"
-
r2 = Regexp.new(s1) #=> /(?ix-m:ab+c)/
-
r1 == r2 #=> false
-
r1.source #=> "ab+c"
-
r2.source #=> "(?ix-m:ab+c)"
--
NARUSE, Yui naruse@airemix.jp
=end
Updated by matz (Yukihiro Matsumoto) over 16 years ago
=begin
Hi,
In message "Re: [ruby-core:18610] Re: [Bug #564] Regexp fails on UTF-16 & UTF-32 character encodings"
on Tue, 16 Sep 2008 04:53:18 +0900, "NARUSE, Yui" naruse@airemix.jp writes:
|> So it's inspect() that has the issues, right?
|
|YES, a reason of this problem is Regexp#inspect.
|So a patch is following.
Can you commit?
matz.
=end