Bug #20315
openQuantifier expansion leads to different results in Regexp.
Description
Consider the following series of regular expression matches:
irb(main):001:0> /(|a)(|a)(|a)(|a)(|a)b/.match("aaaab")
=> #<MatchData "aaaab" 1:"" 2:"a" 3:"a" 4:"a" 5:"a">
irb(main):002:0> /(|a)(|a)(|a)(|a)b/.match("aaab")
=> #<MatchData "aaab" 1:"" 2:"a" 3:"a" 4:"a">
irb(main):003:0> /(|a)(|a)(|a)b/.match("aab")
=> #<MatchData "aab" 1:"" 2:"a" 3:"a">
irb(main):004:0> /(|a)(|a)b/.match("ab")
=> #<MatchData "ab" 1:"" 2:"a">
Let X^{N}
mean N
concatenated repetitions of X
. When matching the pattern /(|a)^{N}b/
against a^{N-1}b
, the first group will match the empty string and the last N-1
groups will match a
.
Now, let's look at this series of similar expressions, in which (|a)^{N}
is replaced with (|a){N}
, i.e. a counted quantifier.
irb(main):001:0> /(|a){5}b/.match("aaaab")
=> #<MatchData "aaaab" 1:"">
irb(main):002:0> /(|a){4}b/.match("aaab")
=> #<MatchData "aaab" 1:"">
irb(main):003:0> /(|a){3}b/.match("aab")
=> #<MatchData "aab" 1:"">
irb(main):004:0> /(|a){2}b/.match("ab")
=> #<MatchData "ab" 1:"a">
When matching the pattern /(|a){N}b/
against a^{N-1}b
, the first N-1
iterations will match a
and the N
-th iteration will match the empty string (compare this with the behavior of the first series of expressions). However, something strange happens when N
is 2. We end up getting a result which is not consistent with this series, but looks like a result that belongs to the first series discussed above.
This is due to quantifier expansion done by the regexp compiler (see usages of QUANTIFIER_EXPAND_LIMIT_SIZE
in regcomp.c
). This is an optimization that tries to remove the overhead of managing a counted repetition during regex execution at the cost of increasing the size of the compiled regex bytecode. The source of the inconsistency is caused by the fact that this optimization can actually change the semantics of the regular expression, because as we have seen above, X^{N}
(X
repeated N
times) can have different semantics from X{N}
(a single X
with an {N}
quantifier).
Compilation and execution of /(|a){3}b/.match("aab")
with ONIG_DEBUG_PARSE_TREE
, ONIG_DEBUG_COMPILE
and ONIG_DEBUG_MATCH
:
PATTERN: /(|a){3}b/ (US-ASCII)
<list:556161bb11f0>
<quantifier:556161bb1230>{3,3}
<enclose:556161bb1330> memory:1
<alt:556161bb11b0>
<string:556161bb12f0>
<string:556161bb12b0>a
<string:556161bb13b0>b
optimize: EXACT
anchor: []
sub anchor: []
exact: [b]: length: 1
code length: 37
0:[repeat:0:27] 7:[null-check-start:0] 10:[mem-start:1] 13:[push:(+5)] 18:[jump:(+2)]
23:[exact1:a] 25:[mem-end:1] 28:[null-check-end-memst:0] 31:[repeat-inc:0] 34:[exact1:b]
36:[end]
match_at: str: 140607416406704 (0x7fe1b71b92b0), end: 140607416406706 (0x7fe1b71b92b2), start: 140607416406704 (0x7fe1b71b92b0), sprev: 0 ((nil))
size: 2, start offset: 0
ofs> str stk:type addr:opcode
0> "ab" 0:Alt 0:[repeat:0:27]
0> "ab" 1:Rep 7:[null-check-start:0]
0> "ab" 2:NulChS 10:[mem-start:1]
0> "ab" 2:NulChS 13:[push:(+5)]
0> "ab" 3:Alt 18:[jump:(+2)]
0> "ab" 3:Alt 25:[mem-end:1]
0> "ab" 3:Alt 28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip id:0, s:140607416406704 (0x7fe1b71b92b0)
0> "ab" 3:Alt 34:[exact1:b]
0> "ab" 2:NulChS 23:[exact1:a]
1> "b" 2:NulChS 25:[mem-end:1]
1> "b" 2:NulChS 28:[null-check-end-memst:0]
1> "b" 2:NulChS 31:[repeat-inc:0]
1> "b" 3:RepInc 7:[null-check-start:0]
1> "b" 4:NulChS 10:[mem-start:1]
1> "b" 4:NulChS 13:[push:(+5)]
1> "b" 5:Alt 18:[jump:(+2)]
1> "b" 5:Alt 25:[mem-end:1]
1> "b" 5:Alt 28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip id:0, s:140607416406705 (0x7fe1b71b92b1)
1> "b" 5:Alt 34:[exact1:b]
2> "" 5:Alt 36:[end]
Compilation and execution of /(|a){2}b/.match("ab")
with ONIG_DEBUG_PARSE_TREE
, ONIG_DEBUG_COMPILE
and ONIG_DEBUG_MATCH
:
PATTERN: /(|a){2}b/ (US-ASCII)
<list:55bbc826c1f0>
<quantifier:55bbc826c230>{2,2}
<enclose:55bbc826c330> memory:1
<alt:55bbc826c1b0>
<string:55bbc826c2f0>
<string:55bbc826c2b0>a
<string:55bbc826c3b0>b
optimize: EXACT
anchor: []
sub anchor: []
exact: [b]: length: 1
code length: 39
0:[mem-start:1] 3:[push:(+5)] 8:[jump:(+2)] 13:[exact1:a] 15:[mem-end:1]
18:[mem-start:1] 21:[push:(+5)] 26:[jump:(+2)] 31:[exact1:a] 33:[mem-end:1]
36:[exact1:b] 38:[end]
match_at: str: 140139875963504 (0x7f74db869270), end: 140139875963506 (0x7f74db869272), start: 140139875963504 (0x7f74db869270), sprev: 0 ((nil))
size: 2, start offset: 0
ofs> str stk:type addr:opcode
0> "ab" 0:Alt 0:[mem-start:1]
0> "ab" 0:Alt 3:[push:(+5)]
0> "ab" 1:Alt 8:[jump:(+2)]
0> "ab" 1:Alt 15:[mem-end:1]
0> "ab" 1:Alt 18:[mem-start:1]
0> "ab" 1:Alt 21:[push:(+5)]
0> "ab" 2:Alt 26:[jump:(+2)]
0> "ab" 2:Alt 33:[mem-end:1]
0> "ab" 2:Alt 36:[exact1:b]
0> "ab" 1:Alt 31:[exact1:a]
1> "b" 1:Alt 33:[mem-end:1]
1> "b" 1:Alt 36:[exact1:b]
2> "" 1:Alt 38:[end]
No data to display