Actions

Copy link

Bug #564

closed

Regexp fails on UTF-16 & UTF-32 character encodings

Added by mike (Michael Selig) almost 17 years ago. Updated over 14 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

Backport:

[ruby-core:18594]

Description

=begin
UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings) don't seem to be work as Regexp patterns.

Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings: US-ASCII and UTF-16BE
=end

Actions

Copy link

Updated by matz (Yukihiro Matsumoto) almost 17 years ago

Status changed from Open to Rejected

=begin

=end

Actions

Copy link

Updated by naruse (Yui NARUSE) almost 17 years ago

=begin
Hi,

James Gray wrote:

On Sep 15, 2008, at 3:49 AM, Michael Selig wrote:

On Mon, 15 Sep 2008 18:08:14 +1000, Tanaka Akira akr@fsij.org wrote:

In article 48cddb5533ad_8725cd9524342@redmine.ruby-lang.org,
Michael Selig redmine@ruby-lang.org writes:

UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings)
don't seem to be work as Regexp patterns.

Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings:
US-ASCII and UTF-16BE

% ruby -ve 'p Regexp.new("abc".encode("UTF-16BE")) =~
"abc".encode("UTF-16BE")'
ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux]
0

I see, I have diagnosed the problem wrongly. I was using irb.

ruby -ve 'p Regexp.new("abc".encode("UTF-16BE"))'
ruby 1.9.0 (2008-09-03 revision 19073) [i686-linux]
-e:1:in p': incompatible character encodings: UTF-16BE and ASCII-8BIT (EncodingCompatibilityError) from -e:1:in '

This is the error I was getting in irb, and I mistakenly assumed it
was from the Regexp::new.
It is a different problem - not as bad as I thought!

So it's inspect() that has the issues, right?

YES, a reason of this problem is Regexp#inspect.
So a patch is following.

--- re.c (revision 19371)
+++ re.c (working copy)
@@ -381,7 +381,7 @@ rb_reg_desc(const char *s, long len, VAL
{
VALUE str = rb_str_buf_new2("/");

rb_enc_copy(str, re);

rb_enc_associate(str, rb_usascii_encoding());
rb_reg_expr_str(str, s, len);
rb_str_buf_cat2(str, "/");
if (re) {

The result of Regexp#inspect is only for see the content of regexp to debug,
so there may be no reason to keep original encoding.

Of course Regexp#source must keep it.¶

Anyway, Regexp#to_s is alias of Regexp#source now.
But Regexp#inspect is more readble.
How about make Regexp#to_s as alias of Regexp#inspect ?

```
 r1 = /ab+c/ix           #=> /ab+c/ix
```

 s1 = r1.to_s            #=> "(?ix-m:ab+c)"

 r2 = Regexp.new(s1)     #=> /(?ix-m:ab+c)/

```
 r1 == r2                #=> false
```
```
 r1.source               #=> "ab+c"
```

 r2.source               #=> "(?ix-m:ab+c)"

--
NARUSE, Yui naruse@airemix.jp

=end

Actions

Copy link

Updated by matz (Yukihiro Matsumoto) almost 17 years ago

=begin
Hi,

In message "Re: [ruby-core:18610] Re: [Bug #564] Regexp fails on UTF-16 & UTF-32 character encodings"
on Tue, 16 Sep 2008 04:53:18 +0900, "NARUSE, Yui" naruse@airemix.jp writes:

|> So it's inspect() that has the issues, right?
|
|YES, a reason of this problem is Regexp#inspect.
|So a patch is following.

Can you commit?

						matz.

=end

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #564

Regexp fails on UTF-16 & UTF-32 character encodings

Updated by matz (Yukihiro Matsumoto) almost 17 years ago

Updated by naruse (Yui NARUSE) almost 17 years ago

Of course Regexp#source must keep it.¶

Updated by matz (Yukihiro Matsumoto) almost 17 years ago