Bug #17990
openInconsistent behavior of Regexp quantifiers over characters with complex case foldings
Description
With case insensitive Regexps, the string "ff"
is considered equal to the string "\ufb00"
with a single ligature character.
irb(main):001:0> /ff/i.match("\ufb00")
=> #<MatchData "ff">
This behavior also persists when the string "ff"
doesn't appear literally in the Regexp source but is expressed using a fixed-length quantifier, as in the following:
irb(main):002:0> /f{2}/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):003:0> /f{2,2}/i.match("\ufb00")
=> #<MatchData "ff">
However, this doesn't hold in general. When using other quantifiers, the ligature character "\ufb00"
is not recognized a sequence of two "f"
characters.
irb(main):004:0> /f*/i.match("\ufb00")
=> #<MatchData "">
irb(main):005:0> /f+/i.match("\ufb00")
=> nil
irb(main):006:0> /f{1,}/i.match("\ufb00")
=> nil
irb(main):007:0> /f{1,2}/i.match("\ufb00")
=> nil
irb(main):008:0> /f{,2}/i.match("\ufb00")
=> #<MatchData "">
irb(main):009:0> /ff?/i.match("\ufb00")
=> nil
This leads to inconsistent behavior where a Regexp like /f{1,2}/i
matches fewer strings than the more strict Regexp /f{2,2}/i
.
I suspect that this is caused by the pattern analyzer directly expanding /f{2}/i
and /f{2,2}/i
into /ff/i
. However, this optimization then changes the semantics of the Regexp, as it is otherwise impossible to match a single ligature character via multiple repetitions of a quantified expression.
While experimenting with this case, I have also discovered a related issue (caused by the problematic expansions of /f{n}/i
and the issue reported here: https://bugs.ruby-lang.org/issues/17989).
These match:
/f{100}/i.match("f" * 100)
/f{100}/i.match("\ufb00" * 50)
/f{100}/i.match("\ufb00" * 49 + "ff")
/f{100}/i.match("ff" + "\ufb00" * 49)
However, this doesn't match:
/f{100}/i.match("f" + "\ufb00" * 49 + "f")
No data to display