Project

General

Profile

Actions

Bug #18012

open

Case-insensitive character classes can only match multiple code points when top-level character class is not negated

Added by jirkamarsik (Jirka Marsik) over 3 years ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
[ruby-core:104435]

Description

Some Unicode characters case-fold to strings of multiple code points, e.g. the ligature \ufb00 can match the string ff.

irb(main):001:0> /\A[\ufb00]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):002:0> /\A[\ufb00]\z/i.match("ff")
=> #<MatchData "ff">

As expected, when we negate this character class, we can no longer match neither the ligature character \ufb00 nor the string ff.

irb(main):003:0> /\A[^\ufb00]\z/i.match("\ufb00")
=> nil
irb(main):004:0> /\A[^\ufb00]\z/i.match("ff")
=> nil

Then, when we add a second negation, the \ufb00 ligature reappears in the character set but the string ff is no longer accepted.

irb(main):005:0> /\A[^[^\ufb00]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):006:0> /\A[^[^\ufb00]]\z/i.match("ff")
=> nil

This reveals that the multi-code-point matches in character classes are blocked by negation. However, this is implemented only by checking whether the topmost character class is negated. If we wrap the character class in another set of brackets, the semantics change.

irb(main):007:0> /\A[[^[^\ufb00]]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):008:0> /\A[[^[^\ufb00]]]\z/i.match("ff")
=> #<MatchData "ff">

The cause behind this discrepancy (the fact that [^[^\ufb00]] and [[^[^\ufb00]]] match different strings) is the extra IS_NCCLASS_NOT check in i_apply_case_fold (https://github.com/ruby/ruby/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5568).

No data to display

Actions

Also available in: Atom PDF

Like0