Project

General

Profile

Actions

Bug #4014

closed

Case-Sensitivity of Property Names Depends on Regexp Encoding

Added by runpaint (Run Paint Run Run) over 13 years ago. Updated almost 13 years ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 1.9.3dev (2010-10-28 trunk 29616) [x86_64-linux]
Backport:
[ruby-core:33000]

Description

=begin
A ticket filed against Read Ruby reminded me of the following inconsistency: in Unicode regexps, property names are case-insensitive; in all other encodings, property names are case-sensitive. This was exacerbated by the reporter's IRB using UTF-8 for regexps, while external scripts used US-ASCII: a seemingly-identical pattern was succeeding in the former case, but failing in the latter.

 run@paint:~$ ruby -e 'p /\p{ascii}/u'
 /\p{ascii}/
 run@paint:~$ ruby -e 'p /\p{ascii}/n'
 -e:1: invalid character property name {ascii}: /\p{ascii}/
 run@paint:~$ ruby -e 'p /\p{ASCII}/n'
 /\p{ASCII}/n
 run@paint:~$ ruby -e 'p /\p{ASCII}/u'
 /\p{ASCII}/

All regexps, regardless of their encoding, support the POSIX bracket names, e.g. xdigit, as properties with the \p{} and \P{} escapes. Unicode regexps normalise the property name by converting to lowercase and ignoring ' ' and '_'. Accordingly, a \p{posix} escape, where posix is a name defined in http://www.opengroup.org/onlinepubs/007908799/xbd/re.html , is case-sensitive in all non-Unicode encodings. Note that this also affects encodings who have other property names in common with Unicode. For example, both Shift-JS and Unicode define Katakana and Hiragana, yet only Unicode ignores case.

I would prefer if \p{} and \P{} always ignored the case of their arguments. Unicode regexps would override this behaviour so as to ignore ' ' and '_', too.
=end

Actions #1

Updated by naruse (Yui NARUSE) over 13 years ago

  • Status changed from Open to Assigned
  • Assignee set to naruse (Yui NARUSE)
  • Priority changed from 3 to Normal

=begin

=end

Actions #2

Updated by duerst (Martin Dürst) over 13 years ago

=begin
I'd personally have preferred that Unicode regexps would stay with past
practice of keeping these things case-sensitive, and otherwise defined
exactly down to the last character. Regexps are for programmers, not for
end users, and programmers know that they have to distinguish upper- and
lower-case. Alas, that may be too late now.

Regards, Martin.

On 2010/11/02 2:17, Run Paint Run Run wrote:

Bug #4014: Case-Sensitivity of Property Names Depends on Regexp Encoding
http://redmine.ruby-lang.org/issues/show/4014

Author: Run Paint Run Run
Status: Open, Priority: Low
Category: M17N
ruby -v: ruby 1.9.3dev (2010-10-28 trunk 29616) [x86_64-linux]

A ticket filed against Read Ruby reminded me of the following inconsistency: in Unicode regexps, property names are case-insensitive; in all other encodings, property names are case-sensitive. This was exacerbated by the reporter's IRB using UTF-8 for regexps, while external scripts used US-ASCII: a seemingly-identical pattern was succeeding in the former case, but failing in the latter.

 run@paint:~$ ruby -e 'p /\p{ascii}/u'
 /\p{ascii}/
 run@paint:~$ ruby -e 'p /\p{ascii}/n'
 -e:1: invalid character property name {ascii}: /\p{ascii}/
 run@paint:~$ ruby -e 'p /\p{ASCII}/n'
 /\p{ASCII}/n
 run@paint:~$ ruby -e 'p /\p{ASCII}/u'
 /\p{ASCII}/

All regexps, regardless of their encoding, support the POSIX bracket names, e.g. xdigit, as properties with the \p{} and \P{} escapes. Unicode regexps normalise the property name by converting to lowercase and ignoring ' ' and '_'. Accordingly, a \p{posix} escape, where posix is a name defined in http://www.opengroup.org/onlinepubs/007908799/xbd/re.html , is case-sensitive in all non-Unicode encodings. Note that this also affects encodings who have other property names in common with Unicode. For example, both Shift-JS and Unicode define Katakana and Hiragana, yet only Unicode ignores case.

I would prefer if \p{} and \P{} always ignored the case of their arguments. Unicode regexps would override this behaviour so as to ignore ' ' and '_', too.


http://redmine.ruby-lang.org

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp
=end

Actions #3

Updated by duerst (Martin Dürst) over 13 years ago

=begin
I'd personally have preferred that Unicode regexps would stay with past
practice of keeping these things case-sensitive, and otherwise defined
exactly down to the last character. Regexps are for programmers, not for
end users, and programmers know that they have to distinguish upper- and
lower-case. Alas, that may be too late now.

Regards, Martin.

On 2010/11/02 2:17, Run Paint Run Run wrote:

Bug #4014: Case-Sensitivity of Property Names Depends on Regexp Encoding
http://redmine.ruby-lang.org/issues/show/4014

Author: Run Paint Run Run
Status: Open, Priority: Low
Category: M17N
ruby -v: ruby 1.9.3dev (2010-10-28 trunk 29616) [x86_64-linux]

A ticket filed against Read Ruby reminded me of the following inconsistency: in Unicode regexps, property names are case-insensitive; in all other encodings, property names are case-sensitive. This was exacerbated by the reporter's IRB using UTF-8 for regexps, while external scripts used US-ASCII: a seemingly-identical pattern was succeeding in the former case, but failing in the latter.

 run@paint:~$ ruby -e 'p /\p{ascii}/u'
 /\p{ascii}/
 run@paint:~$ ruby -e 'p /\p{ascii}/n'
 -e:1: invalid character property name {ascii}: /\p{ascii}/
 run@paint:~$ ruby -e 'p /\p{ASCII}/n'
 /\p{ASCII}/n
 run@paint:~$ ruby -e 'p /\p{ASCII}/u'
 /\p{ASCII}/

All regexps, regardless of their encoding, support the POSIX bracket names, e.g. xdigit, as properties with the \p{} and \P{} escapes. Unicode regexps normalise the property name by converting to lowercase and ignoring ' ' and '_'. Accordingly, a \p{posix} escape, where posix is a name defined in http://www.opengroup.org/onlinepubs/007908799/xbd/re.html , is case-sensitive in all non-Unicode encodings. Note that this also affects encodings who have other property names in common with Unicode. For example, both Shift-JS and Unicode define Katakana and Hiragana, yet only Unicode ignores case.

I would prefer if \p{} and \P{} always ignored the case of their arguments. Unicode regexps would override this behaviour so as to ignore ' ' and '_', too.


http://redmine.ruby-lang.org

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp

=end

Actions #4

Updated by runpaint (Run Paint Run Run) over 13 years ago

=begin
On Tue, Nov 2, 2010 at 5:07 AM, "Martin J. Dürst"
wrote:

I'd personally have preferred that Unicode regexps would stay with past
practice of keeping these things case-sensitive, and otherwise defined
exactly down to the last character. Regexps are for programmers, not for end
users, and programmers know that they have to distinguish upper- and
lower-case. Alas, that may be too late now.

Unicode TR#18 http://unicode.org/reports/tr18/#Categories:

"The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names.
It is strongly recommended that both names be recognized, and that
loose matching of property names be used, whereby the case
distinctions, whitespace, hyphens, and underbar are ignored."

=end

Actions #5

Updated by naruse (Yui NARUSE) over 13 years ago

=begin
Hmm, it's a difficult problem...

run@paint:~$ ruby -e 'p /\p{ascii}/u'
/\p{ascii}/
run@paint:~$ ruby -e 'p /\p{ascii}/n'
-e:1: invalid character property name {ascii}: /\p{ascii}/
run@paint:~$ ruby -e 'p /\p{ASCII}/n'
/\p{ASCII}/n
run@paint:~$ ruby -e 'p /\p{ASCII}/u'
/\p{ASCII}/

A spec may deny \p/\P for non Unicode regexps, it breaks some compatibility:
Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
Print, Punct, Space, Upper, XDigit, ASCII, Word
They are case sensitive and limited to \p/\P, not [[:alnum:]].
This has a good side effect that we can assume /\p{Alpha}/ must be a UTF-8 regexp.

Another spec may only allow lower case for non Unicode, but it seems late.
Martin says Unicode's guideline is wrong, but the compatibility for both ruby and other languages
following guideline seems correct.

RunPaint's suggestion is reasonable one, the patch is following:

diff --git a/regenc.c b/regenc.c
index b9b03b0..f0ddd2c 100644
--- a/regenc.c
+++ b/regenc.c
@@ -789,20 +789,20 @@ extern int
onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
{
static const PosixBracketEntryType PBS[] = {

  • PosixBracketEntryInit("Alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("Alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("Blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("Cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("Digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("Graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("Lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("Print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("Punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("Space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("Upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("XDigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ASCII", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("Word", ONIGENC_CTYPE_WORD),
  • PosixBracketEntryInit("alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("xdigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ascii", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("word", ONIGENC_CTYPE_WORD),
    };
const PosixBracketEntryType *pb, *pbe;

@@ -811,7 +811,7 @@ onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
len = onigenc_strlen(enc, p, end);
for (pbe = (pb = PBS) + sizeof(PBS)/sizeof(PBS[0]); pb < pbe; ++pb) {
if (len == pb->len &&

  •    onigenc_with_ascii_strncmp(enc, p, end, pb->name, pb->len) == 0)
    
  •    STRNCASECMP(p, pb->name, pb->len) == 0)
     return pb->ctype;
    
    }

=end

Actions #6

Updated by runpaint (Run Paint Run Run) over 13 years ago

=begin

RunPaint's suggestion is reasonable one, the patch is following:

Do we want all valid property names to be case insensitive or just ASCII/Unicode properties? The argument can be made either way, but the former situation leads to a simpler rule. For example, should the first pattern below be valid?

run@paint:~/mir/ruby$ ./ruby -e 'p /\p{katakana}/s'
-e:1: invalid character property name {katakana}: /\p{katakana}/
run@paint:~/mir/ruby$ ./ruby -e 'p /\p{Katakana}/u'
/\p{Katakana}/
run@paint:~/mir/ruby$ ./ruby -e 'p /\p{Katakana}/s'
/\p{Katakana}/

=end

Actions #7

Updated by naruse (Yui NARUSE) over 13 years ago

=begin
2010/11/3 Run Paint Run Run :

Do we want all valid property names to be case insensitive or just ASCII/Unicode properties?

Other than Unicode properties are derived from POSIX ctype.
You know they are lower case.
So current implementation is strange.

The argument can be made either way, but the former situation leads to a simpler rule.
For example, should the first pattern below be valid?

No, that behavior is not intended.
It seems accidental.

--
NARUSE, Yui

=end

Actions #8

Updated by duerst (Martin Dürst) over 13 years ago

=begin

On 2010/11/03 3:41, Yui NARUSE wrote:

Issue #4014 has been updated by Yui NARUSE.

Hmm, it's a difficult problem...

Another spec may only allow lower case for non Unicode, but it seems late.
Martin says Unicode's guideline is wrong,

Well, "was a mistake" is the better way to say it. It's the way it is
now, so we have to live with it.

but the compatibility for both ruby and other languages
following guideline seems correct.

Well, there are essentially three choices:

  • Only lowercase for everything. Explicitly diverge from Unicode TR#18
    for the sake of Ruby-internal consistency. But we already allow
    upper-case, so this would create a compatibility problem for Ruby.

  • Allow variants for Unicode, only lowercase for non-Unicode. Each
    follows tradition/specs, but the difference may be annoying, and there
    at least should be some clear documentation.

  • Allow variants for all encodings. Ruby will be more consistent
    internally, but may not be consistent anymore with other non-Unicode
    implementations.

As it is easier to move from lowercase only to allowing variants than
the other way round, I think we should make sure a few more people think
carefully about this before we apply the patch below. Actually, looking
at it, it doesn't accept things such as:
ALNUM, ALnum, aLNUM, aLnUm,...
As far as I understand Unicode TR#, these are all included in "whereby
the case distinctions, ... are ignored.".

Regards, Martin.

RunPaint's suggestion is reasonable one, the patch is following:

diff --git a/regenc.c b/regenc.c
index b9b03b0..f0ddd2c 100644
--- a/regenc.c
+++ b/regenc.c
@@ -789,20 +789,20 @@ extern int
onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
{
static const PosixBracketEntryType PBS[] = {

  • PosixBracketEntryInit("Alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("Alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("Blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("Cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("Digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("Graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("Lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("Print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("Punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("Space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("Upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("XDigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ASCII", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("Word", ONIGENC_CTYPE_WORD),
  • PosixBracketEntryInit("alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("xdigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ascii", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("word", ONIGENC_CTYPE_WORD),
    };
const PosixBracketEntryType *pb, *pbe;

@@ -811,7 +811,7 @@ onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
len = onigenc_strlen(enc, p, end);
for (pbe = (pb = PBS) + sizeof(PBS)/sizeof(PBS[0]); pb< pbe; ++pb) {
if (len == pb->len&&

  •    onigenc_with_ascii_strncmp(enc, p, end, pb->name, pb->len) == 0)
    
  •    STRNCASECMP(p, pb->name, pb->len) == 0)
      return pb->ctype;
    
    }

http://redmine.ruby-lang.org/issues/show/4014


http://redmine.ruby-lang.org

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp
=end

Actions #9

Updated by duerst (Martin Dürst) over 13 years ago

=begin

On 2010/11/03 3:41, Yui NARUSE wrote:

Issue #4014 has been updated by Yui NARUSE.

Hmm, it's a difficult problem...

Another spec may only allow lower case for non Unicode, but it seems late.
Martin says Unicode's guideline is wrong,

Well, "was a mistake" is the better way to say it. It's the way it is
now, so we have to live with it.

but the compatibility for both ruby and other languages
following guideline seems correct.

Well, there are essentially three choices:

  • Only lowercase for everything. Explicitly diverge from Unicode TR#18
    for the sake of Ruby-internal consistency. But we already allow
    upper-case, so this would create a compatibility problem for Ruby.

  • Allow variants for Unicode, only lowercase for non-Unicode. Each
    follows tradition/specs, but the difference may be annoying, and there
    at least should be some clear documentation.

  • Allow variants for all encodings. Ruby will be more consistent
    internally, but may not be consistent anymore with other non-Unicode
    implementations.

As it is easier to move from lowercase only to allowing variants than
the other way round, I think we should make sure a few more people think
carefully about this before we apply the patch below. Actually, looking
at it, it doesn't accept things such as:
ALNUM, ALnum, aLNUM, aLnUm,...
As far as I understand Unicode TR#, these are all included in "whereby
the case distinctions, ... are ignored.".

Regards, Martin.

RunPaint's suggestion is reasonable one, the patch is following:

diff --git a/regenc.c b/regenc.c
index b9b03b0..f0ddd2c 100644
--- a/regenc.c
+++ b/regenc.c
@@ -789,20 +789,20 @@ extern int
onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
{
static const PosixBracketEntryType PBS[] = {

  • PosixBracketEntryInit("Alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("Alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("Blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("Cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("Digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("Graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("Lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("Print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("Punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("Space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("Upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("XDigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ASCII", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("Word", ONIGENC_CTYPE_WORD),
  • PosixBracketEntryInit("alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("xdigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ascii", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("word", ONIGENC_CTYPE_WORD),
    };
const PosixBracketEntryType *pb, *pbe;

@@ -811,7 +811,7 @@ onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
len = onigenc_strlen(enc, p, end);
for (pbe = (pb = PBS) + sizeof(PBS)/sizeof(PBS[0]); pb< pbe; ++pb) {
if (len == pb->len&&

  •    onigenc_with_ascii_strncmp(enc, p, end, pb->name, pb->len) == 0)
    
  •    STRNCASECMP(p, pb->name, pb->len) == 0)
      return pb->ctype;
    
    }

http://redmine.ruby-lang.org/issues/show/4014


http://redmine.ruby-lang.org

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp

=end

Actions #10

Updated by runpaint (Run Paint Run Run) over 13 years ago

=begin

  • Only lowercase for everything. Explicitly diverge from Unicode TR#18
    for the sake of Ruby-internal consistency. But we already allow
    upper-case, so this would create a compatibility problem for Ruby.

Breaking backward-compatibility is unreasonable and unnecessary.

  • Allow variants for Unicode, only lowercase for non-Unicode. Each
    follows tradition/specs, but the difference may be annoying, and there
    at least should be some clear documentation.

This would break backward-compatibility as well, as the /\p{Katakana}/s example showed.

  • Allow variants for all encodings. Ruby will be more consistent
    internally, but may not be consistent anymore with other non-Unicode
    implementations.

By "non-Unicode implementations" are you referring to other regexp engines that don't support Unicode? If such beasts exist, is it our desire to be compatible with them?

As it is easier to move from lowercase only to allowing variants than
the other way round, I think we should make sure a few more people think
carefully about this before we apply the patch below.

It is confusing to raise a SyntaxError when the encoding of a regexp changes implicitly. This needs fixing either by ignoring property case for Unicode properties regardless of the regexp's encoding, or making all property names case-insensitive.

Actually, looking at it, it doesn't accept things such as:
ALNUM, ALnum, aLNUM, aLnUm,...
As far as I understand Unicode TR#, these are all included in "whereby
the case distinctions, ... are ignored.".

My understanding is the same, and the patch works:

run@paint-desk:~/mir/ruby$ patch <reg.patch
patching file regenc.c
...
run@paint-desk:~/mir/ruby$ irb

/\p{AlNUm}/ #=> /p{AlNUm}/
/\p{AlNUm}/n #=> /p{AlNUm}/n
/\p{AlNUM}/n #=> /p{AlNUM}/n
=end

Actions #11

Updated by naruse (Yui NARUSE) over 13 years ago

=begin
Here is the new patch:
diff --git a/regenc.c b/regenc.c
index b9b03b0..4067079 100644
--- a/regenc.c
+++ b/regenc.c
@@ -789,20 +789,20 @@ extern int
onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
{
static const PosixBracketEntryType PBS[] = {

  • PosixBracketEntryInit("Alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("Alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("Blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("Cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("Digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("Graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("Lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("Print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("Punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("Space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("Upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("XDigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ASCII", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("Word", ONIGENC_CTYPE_WORD),
  • PosixBracketEntryInit("alnum", ONIGENC_CTYPE_ALNUM),
  • PosixBracketEntryInit("alpha", ONIGENC_CTYPE_ALPHA),
  • PosixBracketEntryInit("blank", ONIGENC_CTYPE_BLANK),
  • PosixBracketEntryInit("cntrl", ONIGENC_CTYPE_CNTRL),
  • PosixBracketEntryInit("digit", ONIGENC_CTYPE_DIGIT),
  • PosixBracketEntryInit("graph", ONIGENC_CTYPE_GRAPH),
  • PosixBracketEntryInit("lower", ONIGENC_CTYPE_LOWER),
  • PosixBracketEntryInit("print", ONIGENC_CTYPE_PRINT),
  • PosixBracketEntryInit("punct", ONIGENC_CTYPE_PUNCT),
  • PosixBracketEntryInit("space", ONIGENC_CTYPE_SPACE),
  • PosixBracketEntryInit("upper", ONIGENC_CTYPE_UPPER),
  • PosixBracketEntryInit("xdigit", ONIGENC_CTYPE_XDIGIT),
  • PosixBracketEntryInit("ascii", ONIGENC_CTYPE_ASCII),
  • PosixBracketEntryInit("word", ONIGENC_CTYPE_WORD),
    };
const PosixBracketEntryType *pb, *pbe;

@@ -811,7 +811,7 @@ onigenc_minimum_property_name_to_ctype(OnigEncoding enc, UChar* p, UChar* end)
len = onigenc_strlen(enc, p, end);
for (pbe = (pb = PBS) + sizeof(PBS)/sizeof(PBS[0]); pb < pbe; ++pb) {
if (len == pb->len &&

  •    onigenc_with_ascii_strncmp(enc, p, end, pb->name, pb->len) == 0)
    
  •    STRNCASECMP(p, pb->name, pb->len) == 0)
     return pb->ctype;
    
    }

@@ -897,7 +897,9 @@ onigenc_property_list_add_property(UChar* name, const OnigCodePoint* prop,
{
#define PROP_INIT_SIZE 16

  • int r;
  • int i, r;

  • size_t len;

  • char *propname;

    if (*psize <= *pnum) {
    int new_size = (psize == 0 ? PROP_INIT_SIZE : psize * 2);
    @@ -912,8 +914,14 @@ onigenc_property_list_add_property(UChar
    name, const OnigCodePoint
    prop,
    if (ONIG_IS_NULL(*table)) return ONIGERR_MEMORY;
    }

  • len = strlen((char *)name);

  • propname = ALLOC_N(char, len+1);

  • for (i = 0; i < (int)len; i++) {

  •  propname[i] = ONIGENC_ASCII_CODE_TO_LOWER_CASE(name[i]);
    
  • }

  • *pnum = *pnum + 1;

  • onig_st_insert_strend(table, name, name + strlen((char )name),
  • onig_st_insert_strend(*table, propname, propname + len,
    (hash_data_type )(*pnum + ONIGENC_MAX_STD_CTYPE));
    return 0;
    }
    =end
Actions #12

Updated by runpaint (Run Paint Run Run) over 13 years ago

=begin

Here is the new patch: [deletia]

Looks good to me. The policy, then, is that non-Unicode regexps are only case-sensitive for the non-POSIX-bracket properties.
=end

Actions #13

Updated by naruse (Yui NARUSE) over 13 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

=begin
This issue was solved with changeset r29732.
Run Paint, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0