Bug #18012
openCase-insensitive character classes can only match multiple code points when top-level character class is not negated
Description
Some Unicode characters case-fold to strings of multiple code points, e.g. the ligature \ufb00
can match the string ff
.
irb(main):001:0> /\A[\ufb00]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):002:0> /\A[\ufb00]\z/i.match("ff")
=> #<MatchData "ff">
As expected, when we negate this character class, we can no longer match neither the ligature character \ufb00
nor the string ff
.
irb(main):003:0> /\A[^\ufb00]\z/i.match("\ufb00")
=> nil
irb(main):004:0> /\A[^\ufb00]\z/i.match("ff")
=> nil
Then, when we add a second negation, the \ufb00
ligature reappears in the character set but the string ff
is no longer accepted.
irb(main):005:0> /\A[^[^\ufb00]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):006:0> /\A[^[^\ufb00]]\z/i.match("ff")
=> nil
This reveals that the multi-code-point matches in character classes are blocked by negation. However, this is implemented only by checking whether the topmost character class is negated. If we wrap the character class in another set of brackets, the semantics change.
irb(main):007:0> /\A[[^[^\ufb00]]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):008:0> /\A[[^[^\ufb00]]]\z/i.match("ff")
=> #<MatchData "ff">
The cause behind this discrepancy (the fact that [^[^\ufb00]]
and [[^[^\ufb00]]]
match different strings) is the extra IS_NCCLASS_NOT
check in i_apply_case_fold
(https://github.com/ruby/ruby/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5568).
No data to display