Project

General

Profile

Actions

Bug #13671

open

Regexp with lookbehind and case-insensitivity raises RegexpError only on strings with certain characters

Added by dschweisguth (Dave Schweisguth) over 7 years ago. Updated almost 2 years ago.

Status:
Assigned
Target version:
-
[ruby-core:81742]

Description

Here is a test program:

def test(description)
  begin
    yield
    puts "#{description} is OK"
  rescue RegexpError
    puts "#{description} raises RegexpError"
  end
end

test("ass, case-insensitive, special") { /(?<!ass)/i =~ '✨' }
test("bss, case-insensitive, special") { /(?<!bss)/i =~ '✨' }
test("as,  case-insensitive, special") { /(?<!as)/i  =~ '✨' }
test("ss,  case-insensitive, special") { /(?<!ss)/i  =~ '✨' }
test("ass, case-sensitive,   special") { /(?<!ass)/  =~ '✨' }
test("ass, case-insensitive, regular") { /(?<!ass)/i =~ 'x' }

Running the test program with Ruby 2.4.1 (macOS) gives

ass, case-insensitive, special raises RegexpError
bss, case-insensitive, special raises RegexpError
as,  case-insensitive, special is OK
ss,  case-insensitive, special is OK
ass, case-sensitive,   special is OK
ass, case-insensitive, regular is OK

The RegexpError is "invalid pattern in look-behind: /(?<!ass)/i (RegexpError)"

Side note: in the real code in which I found this error I was able to work around the error by using (?i) after the lookbehind instead of //i.

Running the test program with Ruby 2.3.4 does not report any RegexpErrors.

I think this is a regression, although I might be wrong and it might be saving me from an incorrect result with certain strings.


Files

test.rb (531 Bytes) test.rb Same test program as in the description dschweisguth (Dave Schweisguth), 06/22/2017 11:18 PM

Related issues 1 (1 open0 closed)

Related to Ruby master - Bug #14838: RegexpError with double "s" in look-behind assertion in case-insensitive unicode regexpOpenActions

Updated by Hanmac (Hans Mackowiak) over 7 years ago

did some checks on my windows system to check how deep the problem is.
i used "ä" as variable.

the same problem happens when you try to use match function too:
/(?<!ass)/i.match('ä')
also happen for
Regexp.union(/(?<!ass)/i, /ä/)

but i still don't understand why it does crash with ass, while ss works.
might have something todo how regexp are stored internal

Updated by gotoken (Kentaro Goto) over 6 years ago

I encountered a non ss case. Is this a same problem?

% ruby -ve '"".match(/(?<=ast)/ui)'
ruby 2.6.0dev (2018-08-27 trunk 64549) [x86_64-linux]
-e:1: invalid pattern in look-behind: /(?<=ast)/i

It was reproduced in version 2.4 and 2.5.
#14838 seems to be duplicate.

Updated by znz (Kazuhiro NISHIYAMA) over 6 years ago

You can use (?:s) instead of s for workaround.

$ ruby -ve '/(?<=ast)/iu'
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
-e:1: invalid pattern in look-behind: /(?<=ast)/i
-e:1: warning: possibly useless use of a literal in void context
$ ruby -ve '/(?<=a(?:s)t)/iu'
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
-e:1: warning: possibly useless use of a literal in void context
Actions #5

Updated by znz (Kazuhiro NISHIYAMA) over 6 years ago

  • Related to Bug #14838: RegexpError with double "s" in look-behind assertion in case-insensitive unicode regexp added

Updated by gotoken (Kentaro Goto) over 6 years ago

Thanks znz. The workaround is helpful. And I understood what was happened.
https://github.com/k-takata/Onigmo/issues/92#issuecomment-373981492 shows how some combinations of letters are variable length.

For example, "ss" and "st" are mapped "ß" ("\u00DF") and "st" ("\uFB06").
Those combinations are listed in ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

By the way, this expansion by //i option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...

Updated by shyouhei (Shyouhei Urabe) over 6 years ago

gotoken (Kentaro Goto) wrote:

By the way, this expansion by //i option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...

I know how you feel. Too bad we are just doing what Unicode specifies to do.

See also http://unicode.org/faq/casemap_charprop.html#11

Updated by gotoken (Kentaro Goto) over 6 years ago

Thanks shyouhei for your pointing out.

I imagine another Rexexp option, say //I, which is almost the same as //i except for never-applying SpecialCasing mapping.
This change extends Unicode matching indeed but does not introduce incompatibilities, IMHO.
A difficulty is the implementation is on the upstream library and cruby is just a user.

Updated by duerst (Martin Dürst) over 6 years ago

gotoken (Kentaro Goto) wrote:

For example, "ss" and "st" are mapped "ß" ("\u00DF") and "st" ("\uFB06").
Those combinations are listed in ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

By the way, this expansion by //i option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...

I still have to verify this, but currently I strongly suspect that the problem is NOT in SpecialCasing, but in how Onigmo (/Oniguruma?) implement it.

Updated by mauromorales (Mauro Morales) over 4 years ago

FYI The issue has been addressed in Onigmo https://github.com/k-takata/Onigmo/pull/116 and has already been released in version 6.2.0. I tried it by applying the changes using Ruby 2.6.6 and it works as expected.

Updated by mauromorales (Mauro Morales) almost 4 years ago

Unfortunately, the problem persists in Ruby 2.7.2 and 3.0.0

Updated by Eregon (Benoit Daloze) over 3 years ago

It seems ruby master as of today still uses Onigmo 6.1.3, but https://github.com/k-takata/Onigmo/releases/tag/Onigmo-6.2.0 is needed to fix this bug.

Who can update Onigmo to latest?

Updated by duerst (Martin Dürst) over 3 years ago

Eregon (Benoit Daloze) wrote in #note-12:

It seems ruby master as of today still uses Onigmo 6.1.3, but https://github.com/k-takata/Onigmo/releases/tag/Onigmo-6.2.0 is needed to fix this bug.

Who can update Onigmo to latest?

If nobody else wants to urgently do this, feel free to assign this to me.

Updated by mame (Yusuke Endoh) about 3 years ago

  • Status changed from Open to Assigned
  • Assignee set to duerst (Martin Dürst)

Updated by Eregon (Benoit Daloze) almost 2 years ago

@duerst (Martin Dürst) Could you take a look at this? It's not fixed yet in 3.2.0.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0