As I understand it, the idea behind the null check is for the regex matcher to be able to identify unproductive branches in the regex execution, branches which are guaranteed to never terminate. When executing the expression X*
, where X
is some subexpression, the greedy *
quantifier will always prefer to enter and execute X
and will only stop when X
can no longer be matched. The null check lets the regex matcher know that no part of the execution state has changed since the last iteration of the loop. At that point, the regex matcher knows that continuing the loop will never produce a match and so it can afford to terminate the loop. This is because the state of the matcher is now at a state (matcher position, matched capture groups...) such that no other alternative with higher priority could produce a different state. In other words, this is the highest-priority state which could still yield a match.
We can look at a simplified example from the spec:
/(a|\2b|())*/.match("ab").to_a.should == ["ab", "", ""]
and we can look at traces of 3 possible executions of this regular expression:
A: ENTER * -> ALTERNATIVE 1 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 2 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ...
B: ENTER * -> ALTERNATIVE 1 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 2 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 3 -> EXIT *
C: ENTER * -> ALTERNATIVE 1 -> ENTER * -> ALTERNATIVE 3 -> EXIT *
Trace A
corresponds to the endless "naive" traversal with no null check, where we always choose the highest-priority option (alternatives are tried from left to right, greedy quantifiers always try to match). In trace B
, we identify the endless loop of ENTER * LOOP -> ALTERNATIVE 3 -> ...
and use the null check that's implemented currently in Ruby to terminate the loop after alternative 3 is taken 2 times in a row (the first time it updates capture group 2, the second time it doesn't update any state). Note that trace B
has lower priority than trace A
because at one point, it chooses EXIT *
instead of ENTER *
. However, trace A
never finishes the match and even though there exists an infinity of traces that have higher priority than B
(those that would decide to terminate the loop in some later iteration than B
), because of the null check, we know that they are all observably equivalent to B
and so the regex matcher can proceed with B
as the highest-priority match. In contrast, C
represents the trace that would be produced by a regex implementation whose null check would ignore updates to the state of capture groups and would terminate the loop after the first iteration that matches the empty string. However, the result of C
is not the highest priority match, because there exists B
, which has higher priority, since it chose to execute the body of a greedy quantifier instead of skipping it. I would argue that since the semantics of regular expression search is to return the highest priority match, where priority is determined by the ordering of alternatives and the greediness of quantifiers, the correct answer is the result of executing B
.
Now for the example from your issue, this is clearly a bug and not intended behavior of the null check. The null check should never cause an expression not to match, since it should only prune branches which are known to not terminate.
This expression should evaluate to 0
:
/(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC"
and this expression should terminate:
/((?=(a)))*/ =~ "a"
The fact that neither of these is true means that there is a bug (or two bugs) in Ruby's implementation of regular expressions, quite possibly in the null check. However, it shouldn't be the reason to strengthen the null check so that it starts killing branches that could produce a valid match.
From my experience working on regular expression implementations, Ruby's null check behavior is not uncommon. Python regular expression's null checks take into account capture groups too, as do the null checks in Oracle Database regular expressions.
Correction: Python and Oracle Database regular expressions don't check for capture group updates in null checks. Unlike Ruby, they don't allow for forward back-references and, like Ruby but unlike JavaScript, they keep the state updates from the last iteration of a loop that didn't pass the null check, so it was hard to craft regular expressions that would show behavior different to that of Ruby.