Bug #20316
openRegexp quantifier behavior changes in different syntactic context.
Description
In the example below, adding a vertical bar to the end of a regular expression changes what is being matched by the preceding regular expression.
irb(main):001:0> /(|a){3}b/.match("aab")
=> #<MatchData "aab" 1:"">
irb(main):002:0> /(|a){3}b|/.match("aab")
=> #<MatchData "aab" 1:"a">
This is because the {3}
quantifier is compiled into a repeat
loop which uses the OP_NULL_CHECK_END_MEMST
operation to perform a capture-group sensitive null-check after every loop iteration. The logic behind the OP_NULL_CHECK_END_MEMST
operation is implemented using the STACK_NULL_CHECK_MEMST
macro in regexec.c
. The STACK_NULL_CHECK_MEMST
macro checks whether any capture groups have been updated inside the last loop iteration and it does so by searching the stack for STK_MEM_START
entries. However, such entries are not used for all capture groups. They are only used by capture groups which are listed in bt_mem_start
. A capture group is marked as needing such bookkeeping only when it is either referenced by a back-reference or it appears in certain syntactic contexts (see e.g. around line 4096 of regcomp.c
). This looks like an optimization that tries to avoid polluting the stack with STK_MEM_START
entries in cases in which they are not needed. However, in this case, not putting STK_MEM_START
entries on the stack leads to a different match result.
In the example above, by adding a vertical bar to the end of the regexp, we have placed the group (|a)
inside an alternation. This means that a different operation for tracking the state of the capture group is emitted in the compiled bytecode and this leads to a different result. Incidentally, this result should be the correct result, as the null-check ends up respecting the state of capture groups, as it tries to do in Ruby.
This is the compilation and execution of /(|a){3}b/.match("aab")
with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH. Note that mem-start:1
is used for tracking the state of capture group 1.
PATTERN: /(|a){3}b/ (US-ASCII)
<list:55db929991f0>
<quantifier:55db92999230>{3,3}
<enclose:55db92999330> memory:1
<alt:55db929991b0>
<string:55db929992f0>
<string:55db929992b0>a
<string:55db929993b0>b
optimize: EXACT
anchor: []
sub anchor: []
exact: [b]: length: 1
code length: 37
0:[repeat:0:27] 7:[null-check-start:0] 10:[mem-start:1] 13:[push:(+5)] 18:[jump:(+2)]
23:[exact1:a] 25:[mem-end:1] 28:[null-check-end-memst:0] 31:[repeat-inc:0] 34:[exact1:b]
36:[end]
match_at: str: 139809364021904 (0x7f27e77a9290), end: 139809364021906 (0x7f27e77a9292), start: 139809364021904 (0x7f27e77a9290), sprev: 0 ((nil))
size: 2, start offset: 0
ofs> str stk:type addr:opcode
0> "ab" 0:Alt 0:[repeat:0:27]
0> "ab" 1:Rep 7:[null-check-start:0]
0> "ab" 2:NulChS 10:[mem-start:1]
0> "ab" 2:NulChS 13:[push:(+5)]
0> "ab" 3:Alt 18:[jump:(+2)]
0> "ab" 3:Alt 25:[mem-end:1]
0> "ab" 3:Alt 28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip id:0, s:139809364021904 (0x7f27e77a9290)
0> "ab" 3:Alt 34:[exact1:b]
0> "ab" 2:NulChS 23:[exact1:a]
1> "b" 2:NulChS 25:[mem-end:1]
1> "b" 2:NulChS 28:[null-check-end-memst:0]
1> "b" 2:NulChS 31:[repeat-inc:0]
1> "b" 3:RepInc 7:[null-check-start:0]
1> "b" 4:NulChS 10:[mem-start:1]
1> "b" 4:NulChS 13:[push:(+5)]
1> "b" 5:Alt 18:[jump:(+2)]
1> "b" 5:Alt 25:[mem-end:1]
1> "b" 5:Alt 28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip id:0, s:139809364021905 (0x7f27e77a9291)
1> "b" 5:Alt 34:[exact1:b]
2> "" 5:Alt 36:[end]
This is the compilation and execution of /(|a){3}b|/.match("aab")
(where the group (|a)
appears inside an alternation) with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH. Note that mem-start-push:1
is used for tracking the state of capture group 1, not mem-start:1
.
PATTERN: /(|a){3}b|/ (US-ASCII)
<alt:55d9a91cc3c0>
<list:55d9a91cc200>
<quantifier:55d9a91cc240>{3,3}
<enclose:55d9a91cc340> memory:1
<alt:55d9a91cc1c0>
<string:55d9a91cc300>
<string:55d9a91cc2c0>a
<string:55d9a91cc480>b
<string:55d9a91cc400>
optimize: NONE
anchor: []
code length: 47
0:[push:(+41)] 5:[repeat:0:27] 12:[null-check-start:0] 15:[mem-start-push:1] 18:[push:(+5)]
23:[jump:(+2)] 28:[exact1:a] 30:[mem-end:1] 33:[null-check-end-memst:0] 36:[repeat-inc:0]
39:[exact1:b] 41:[jump:(+0)] 46:[end]
match_at: str: 140072854131304 (0x7f6540b69268), end: 140072854131306 (0x7f6540b6926a), start: 140072854131304 (0x7f6540b69268), sprev: 0 ((nil))
size: 2, start offset: 0
ofs> str stk:type addr:opcode
0> "ab" 0:Alt 0:[push:(+41)]
0> "ab" 1:Alt 5:[repeat:0:27]
0> "ab" 2:Rep 12:[null-check-start:0]
0> "ab" 3:NulChS 15:[mem-start-push:1]
0> "ab" 4:MemS 18:[push:(+5)]
0> "ab" 5:Alt 23:[jump:(+2)]
0> "ab" 5:Alt 30:[mem-end:1]
0> "ab" 5:Alt 33:[null-check-end-memst:0]
0> "ab" 5:Alt 36:[repeat-inc:0]
0> "ab" 6:RepInc 12:[null-check-start:0]
0> "ab" 7:NulChS 15:[mem-start-push:1]
0> "ab" 8:MemS 18:[push:(+5)]
0> "ab" 9:Alt 23:[jump:(+2)]
0> "ab" 9:Alt 30:[mem-end:1]
0> "ab" 9:Alt 33:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip id:0, s:140072854131304 (0x7f6540b69268)
0> "ab" 9:Alt 39:[exact1:b]
0> "ab" 8:MemS 28:[exact1:a]
1> "b" 8:MemS 30:[mem-end:1]
1> "b" 8:MemS 33:[null-check-end-memst:0]
1> "b" 8:MemS 36:[repeat-inc:0]
1> "b" 9:RepInc 12:[null-check-start:0]
1> "b" 10:NulChS 15:[mem-start-push:1]
1> "b" 11:MemS 18:[push:(+5)]
1> "b" 12:Alt 23:[jump:(+2)]
1> "b" 12:Alt 30:[mem-end:1]
1> "b" 12:Alt 33:[null-check-end-memst:0]
1> "b" 12:Alt 36:[repeat-inc:0]
1> "b" 13:RepInc 39:[exact1:b]
2> "" 13:RepInc 41:[jump:(+0)]
2> "" 13:RepInc 46:[end]
No data to display