Bug #13671
openRegexp with lookbehind and case-insensitivity raises RegexpError only on strings with certain characters
Description
Here is a test program:
def test(description)
begin
yield
puts "#{description} is OK"
rescue RegexpError
puts "#{description} raises RegexpError"
end
end
test("ass, case-insensitive, special") { /(?<!ass)/i =~ '✨' }
test("bss, case-insensitive, special") { /(?<!bss)/i =~ '✨' }
test("as, case-insensitive, special") { /(?<!as)/i =~ '✨' }
test("ss, case-insensitive, special") { /(?<!ss)/i =~ '✨' }
test("ass, case-sensitive, special") { /(?<!ass)/ =~ '✨' }
test("ass, case-insensitive, regular") { /(?<!ass)/i =~ 'x' }
Running the test program with Ruby 2.4.1 (macOS) gives
ass, case-insensitive, special raises RegexpError
bss, case-insensitive, special raises RegexpError
as, case-insensitive, special is OK
ss, case-insensitive, special is OK
ass, case-sensitive, special is OK
ass, case-insensitive, regular is OK
The RegexpError is "invalid pattern in look-behind: /(?<!ass)/i (RegexpError)"
Side note: in the real code in which I found this error I was able to work around the error by using (?i) after the lookbehind instead of //i.
Running the test program with Ruby 2.3.4 does not report any RegexpErrors.
I think this is a regression, although I might be wrong and it might be saving me from an incorrect result with certain strings.
Files
Updated by Hanmac (Hans Mackowiak) over 7 years ago
did some checks on my windows system to check how deep the problem is.
i used "ä" as variable.
the same problem happens when you try to use match function too:
/(?<!ass)/i.match('ä')
also happen for
Regexp.union(/(?<!ass)/i, /ä/)
but i still don't understand why it does crash with ass, while ss works.
might have something todo how regexp are stored internal
Updated by naruse (Yui NARUSE) over 7 years ago
I created a ticket in upstream: https://github.com/k-takata/Onigmo/issues/92
Updated by gotoken (Kentaro Goto) over 6 years ago
I encountered a non ss
case. Is this a same problem?
% ruby -ve '"".match(/(?<=ast)/ui)'
ruby 2.6.0dev (2018-08-27 trunk 64549) [x86_64-linux]
-e:1: invalid pattern in look-behind: /(?<=ast)/i
It was reproduced in version 2.4 and 2.5.
#14838 seems to be duplicate.
Updated by znz (Kazuhiro NISHIYAMA) over 6 years ago
You can use (?:s)
instead of s
for workaround.
$ ruby -ve '/(?<=ast)/iu'
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
-e:1: invalid pattern in look-behind: /(?<=ast)/i
-e:1: warning: possibly useless use of a literal in void context
$ ruby -ve '/(?<=a(?:s)t)/iu'
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
-e:1: warning: possibly useless use of a literal in void context
Updated by znz (Kazuhiro NISHIYAMA) over 6 years ago
- Related to Bug #14838: RegexpError with double "s" in look-behind assertion in case-insensitive unicode regexp added
Updated by gotoken (Kentaro Goto) over 6 years ago
Thanks znz. The workaround is helpful. And I understood what was happened.
https://github.com/k-takata/Onigmo/issues/92#issuecomment-373981492 shows how some combinations of letters are variable length.
For example, "ss"
and "st"
are mapped "ß"
("\u00DF"
) and "st"
("\uFB06"
).
Those combinations are listed in ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt
By the way, this expansion by //i
option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...
Updated by shyouhei (Shyouhei Urabe) over 6 years ago
gotoken (Kentaro Goto) wrote:
By the way, this expansion by
//i
option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...
I know how you feel. Too bad we are just doing what Unicode specifies to do.
Updated by gotoken (Kentaro Goto) over 6 years ago
Thanks shyouhei for your pointing out.
I imagine another Rexexp option, say //I
, which is almost the same as //i
except for never-applying SpecialCasing mapping.
This change extends Unicode matching indeed but does not introduce incompatibilities, IMHO.
A difficulty is the implementation is on the upstream library and cruby is just a user.
Updated by duerst (Martin Dürst) over 6 years ago
gotoken (Kentaro Goto) wrote:
For example,
"ss"
and"st"
are mapped"ß"
("\u00DF"
) and"st"
("\uFB06"
).
Those combinations are listed in ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txtBy the way, this expansion by
//i
option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...
I still have to verify this, but currently I strongly suspect that the problem is NOT in SpecialCasing, but in how Onigmo (/Oniguruma?) implement it.
Updated by mauromorales (Mauro Morales) over 4 years ago
FYI The issue has been addressed in Onigmo https://github.com/k-takata/Onigmo/pull/116 and has already been released in version 6.2.0. I tried it by applying the changes using Ruby 2.6.6 and it works as expected.
Updated by mauromorales (Mauro Morales) almost 4 years ago
Unfortunately, the problem persists in Ruby 2.7.2 and 3.0.0
Updated by Eregon (Benoit Daloze) over 3 years ago
It seems ruby master as of today still uses Onigmo 6.1.3, but https://github.com/k-takata/Onigmo/releases/tag/Onigmo-6.2.0 is needed to fix this bug.
Who can update Onigmo to latest?
Updated by duerst (Martin Dürst) over 3 years ago
Eregon (Benoit Daloze) wrote in #note-12:
It seems ruby master as of today still uses Onigmo 6.1.3, but https://github.com/k-takata/Onigmo/releases/tag/Onigmo-6.2.0 is needed to fix this bug.
Who can update Onigmo to latest?
If nobody else wants to urgently do this, feel free to assign this to me.
Updated by mame (Yusuke Endoh) about 3 years ago
- Status changed from Open to Assigned
- Assignee set to duerst (Martin Dürst)
Updated by Eregon (Benoit Daloze) almost 2 years ago
@duerst (Martin Dürst) Could you take a look at this? It's not fixed yet in 3.2.0.