Bug #17989
openCase insensitive Regexps do not handle characters with overlapping case foldings
Description
When a Regexp uses the case-insensitive flag, strings are compared by first case folding them and then comparing the case foldings for equality. When a literal string is encountered in a Regexp source, the pattern analyzer tries to enumerate all possible strings that would case fold to the same string as the string in the pattern. In this way, case folding can be avoided when the Regexp is used to match. However, the algorithm used to enumerate all the possible strings which case fold to the same string is not complete. It assumes that the case foldings of different characters do not overlap (i.e. the multi-character case folding of some character cannot be a prefix or suffix of the multi-character case folding of some other character). However, this is not the case for several Unicode characters.
In the code below, many of the equalities A == B
, tested via /A/i.match("B")
, do not hold. Those that do hold hold only because the number of case-equivalent strings detected by the analyzer crosses a threshold at which point the analyzer abandons this optimization.
/\ufb00/i.match("ff") # LATIN SMALL LIGATURE FF
/\ufb01/i.match("fi") # LATIN SMALL LIGATURE FI
/\ufb02/i.match("fl") # LATIN SMALL LIGATURE FL
/\ufb03/i.match("ffi") # LATIN SMALL LIGATURE FFI
/\ufb04/i.match("ffl") # LATIN SMALL LIGATURE FFL
# (ff)i == (ffi)
/\ufb00i/i.match("\ufb03")
# (ffi) == (ff)i
/\ufb03/i.match("\ufb00i")
# f(fi) == (ffi)
/f\ufb01/i.match("\ufb03")
# (ffi) == f(fi)
/\ufb03/i.match("f\ufb01")
# (ff)l == (ffl)
/\ufb00l/i.match("\ufb04")
# (ffl) == (ff)l
/\ufb04/i.match("\ufb00l")
# f(fl) == (ffl)
/f\ufb02/i.match("\ufb04")
# (ffl) == f(fl)
/\ufb04/i.match("f\ufb02")
/\u1f50/i.match("\u03c5\u0313") # GREEK SMALL LETTER UPSILON WITH PSILI
/\u1f52/i.match("\u03c5\u0313\u0300") # GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
/\u1f54/i.match("\u03c5\u0313\u0301") # GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
/\u1f56/i.match("\u03c5\u0313\u0342") # GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
# (upsilon psili) varia == (upsilon psili varia)
/\u1f50\u0300/i.match("\u1f52")
# (upsilon psili varia) == (upsilon psili) varia
/\u1f52/i.match("\u1f50\u0300")
# (upsilon psili) oxia == (upsilon psili oxia)
/\u1f50\u0301/i.match("\u1f54")
# (upsilon psili oxia) == (upsilon psili) oxia
/\u1f54/i.match("\u1f50\u0301")
# (upsilon psili) perispomeni == (upsilon psili perispomeni)
/\u1f50\u0342/i.match("\u1f56")
# (upsilon psili perispomeni) == (upsilon psili) perispomeni
/\u1f56/i.match("\u1f50\u0342")
/\u1fb6/i.match("\u03b1\u0342") # GREEK SMALL LETTER ALPHA WITH PERISPOMENI
/\u1fb7/i.match("\u03b1\u0342\u03b9") # GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
# (alpha perispomeni) ypogegrammeni == (alpha perispomeni ypogegrammeni)
/\u1fb6\u03b9/i.match("\u1fb7")
# (alpha perispomeni ypogegrammeni) == (alpha perispomeni) ypogegrammeni
/\u1fb7/i.match("\u1fb6\u03b9")
/\u1fc6/i.match("\u03b7\u0342") # GREEK SMALL LETTER ETA WITH PERISPOMENI
/\u1fc7/i.match("\u03b7\u0342\u03b9") # GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
# (eta perispomeni) ypogegrammeni == (eta perispomeni ypogegrammeni)
/\u1fc6\u03b9/i.match("\u1fc7")
# (eta perispomeni ypogegrammeni) == (eta perispomeni) ypogegrammeni
/\u1fc7/i.match("\u1fc6\u03b9")
/\u1ff6/i.match("\u03c9\u0342") # GREEK SMALL LETTER OMEGA WITH PERISPOMENI
/\u1ff7/i.match("\u03c9\u0342\u03b9") # GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
# (omega perispomeni) ypogegrammeni == (omega perispomeni ypogegrammeni)
/\u1ff6\u03b9/i.match("\u1ff7")
# (omega perispomeni ypogegrammeni) == (omega perispomeni) ypogegrammeni
/\u1ff7/i.match("\u1ff6\u03b9")
Updated by jeremyevans0 (Jeremy Evans) over 3 years ago
- Related to Bug #18010: Character class with single character gets case-folded with following string added