Bug #18013
openUnexpected results when mxiing negated character classes and case-folding
Description
irb(main):001:0> /[^a-c]/i.match("A")
=> nil
irb(main):002:0> /[[^a-c]]/i.match("A")
=> #<MatchData "A">
The two regular expressions above match different strings, because the character classes denote different sets of characters. In order for /[^a-c]/i
to produce correct results, Oniguruma provided a fix that can still be easily seen in the code as it is hidden behind an always-on preprocessor flag (CASE_FOLD_IS_APPLIED_INSIDE_NEGATIVE_CCLASS
, https://github.com/ruby/ruby/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5528). The idea of the fix is to first case-fold a character class and only then apply the negation (essentially moving the case-fold operator inside the negation).
In the case of our first regular expression, [a-c]
is case-folded into [a-cA-C]
and that is then inverted into [^a-cA-C]
, which is the expected result. However, this case-folding logic is currently only being applied to the top-most character class and so if we use a nested negated character class, the order of the operations will be switched.
With our second regular expression, [a-c]
will first be negated to yield [^a-c]
, which will then be case-folded into .
, the set of all characters (since [^a-c]
contains A-C
, which case-fold into a-c
).
A way to fix this would be to apply case-folding for nested character classes as well, so that the nested character classes behave the same as the top-most character class. Then, we would get the same semantics for both expressions.
Updated by jirkamarsik (Jirka Marsik) over 3 years ago
This is a simpler reproducer.
irb(main):003:0> /[^a]/i.match("a")
=> nil
irb(main):004:0> /[[^a]]/i.match("a")
=> #<MatchData "a">
Updated by duerst (Martin Dürst) over 3 years ago
Just a question: What's the purpose of nested character classes?
I didn't even know that there was such a thing as nested character classes.
Depending on the purpose of nested character classes, the right way to handle things may differ. This is just a wild guess, but if there's no difference between usual character classes and nested character classes, then there isn't really a purpose for nested character classes.
Updated by jirkamarsik (Jirka Marsik) over 3 years ago
duerst (Martin Dürst) wrote in #note-2:
Just a question: What's the purpose of nested character classes?
They are useful in combination with the set intersection operator &&
. They let you, e.g., exclude characters from some character set, as in the example below, which considers all lowercase-letters except for the English vowels aeiou
.
irb(main):001:0> /[\p{Ll}&&[^aeiou]]/u.match("a")
=> nil
irb(main):002:0> /[\p{Ll}&&[^aeiou]]/u.match("b")
=> #<MatchData "b">
irb(main):003:0> /[\p{Ll}&&[^aeiou]]/u.match(".")
=> nil
irb(main):004:0> /[\p{Ll}&&[^aeiou]]/u.match("α")
=> #<MatchData "α">