=begin
Hi,
James Gray wrote:
On Sep 15, 2008, at 3:49 AM, Michael Selig wrote:
On Mon, 15 Sep 2008 18:08:14 +1000, Tanaka Akira akr@fsij.org wrote:
In article 48cddb5533ad_8725cd9524342@redmine.ruby-lang.org,
Michael Selig redmine@ruby-lang.org writes:
UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings)
don't seem to be work as Regexp patterns.
Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings:
US-ASCII and UTF-16BE
% ruby -ve 'p Regexp.new("abc".encode("UTF-16BE")) =~
"abc".encode("UTF-16BE")'
ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux]
0
I see, I have diagnosed the problem wrongly. I was using irb.
ruby -ve 'p Regexp.new("abc".encode("UTF-16BE"))'
ruby 1.9.0 (2008-09-03 revision 19073) [i686-linux]
-e:1:in p': incompatible character encodings: UTF-16BE and ASCII-8BIT (EncodingCompatibilityError) from -e:1:in
'
This is the error I was getting in irb, and I mistakenly assumed it
was from the Regexp::new.
It is a different problem - not as bad as I thought!
So it's inspect() that has the issues, right?
YES, a reason of this problem is Regexp#inspect.
So a patch is following.
--- re.c (revision 19371)
+++ re.c (working copy)
@@ -381,7 +381,7 @@ rb_reg_desc(const char *s, long len, VAL
{
VALUE str = rb_str_buf_new2("/");
- rb_enc_associate(str, rb_usascii_encoding());
rb_reg_expr_str(str, s, len);
rb_str_buf_cat2(str, "/");
if (re) {
The result of Regexp#inspect is only for see the content of regexp to debug,
so there may be no reason to keep original encoding.
Of course Regexp#source must keep it.¶
Anyway, Regexp#to_s is alias of Regexp#source now.
But Regexp#inspect is more readble.
How about make Regexp#to_s as alias of Regexp#inspect ?
-
r1 = /ab+c/ix #=> /ab+c/ix
-
s1 = r1.to_s #=> "(?ix-m:ab+c)"
-
r2 = Regexp.new(s1) #=> /(?ix-m:ab+c)/
-
r1 == r2 #=> false
-
r1.source #=> "ab+c"
-
r2.source #=> "(?ix-m:ab+c)"
--
NARUSE, Yui naruse@airemix.jp
=end