Bug #21870
openRegexp: Warnings when using slightly overlapping \p{...} classes
Description
$VERBOSE = true
# warning: character class has duplicated range: /[\p{Word}\p{S}]/
regex = /[\p{Word}\p{S}]/
As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using /(?:\p{Word}|\p{S})/ is kind of a workaround, but it is slower (see benchmarks below), and also less clear.
They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges.
For a similar example, consider /[\p{Word}\p{Cf}]/, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges.
This warning was introduced back in 2009 with #1831, to help surface instances of things like /[:lower:]/ instead of /[[:lower:]]/, but even then the reporter suggested only warning if the class both begins and ends with :.
Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address?
Updated by jneen (Jeanine Adkisson) 15 days ago
- ruby -v changed from 4.0.0, 4.0.1, earlier versions to a lesser extent to 4.0.1
Updated by tompng (tomoya ishida) 15 days ago
I found 130 (5 sets of 26 alphabets) characters matching both \p{S} and \p{Word}.
The visual looks like alphabet-ish symbol character
(0..0x10ffff).select{(s=''<<it; s=~/\p{Word}/&&s=~/\p{S}/) rescue false}.map{''<<it}.join
# ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
# ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
# 🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
# 🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
# 🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉
I'm not sure how to read unicode properties, but it looks like these characters are Alphabetic:Yes and also in Other_Symbol category https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%92%B6
Updated by jneen (Jeanine Adkisson) 15 days ago
I see! So they do have some overlap. Is it really correct to warn here though? "Fixing" the warning would require falling back to manual unicode ranges.
Updated by jneen (Jeanine Adkisson) 14 days ago
- Subject changed from Regexp: Warnings when using multiple non-overlapping \p{...} classes to Regexp: Warnings when using slightly overlapping \p{...} classes
Updated by jneen (Jeanine Adkisson) 14 days ago
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 14 days ago
Another example of this is /[\p{Word}\p{Cf}]/, which seem to overlap precisely on ZWNJ (U+200C) and ZWJ (U+200D).
[1] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16 }
=> ["200c", "200d"]
[2] pry(main)> /[\p{Word}\p{Cf}]/
(pry):5: warning: character class has duplicated range: /[\p{Word}\p{Cf}]/
=> /[\p{Word}\p{Cf}]/
[3] pry(main)>
Updated by jneen (Jeanine Adkisson) 14 days ago
- Description updated (diff)
That specific case also appears to have changed, e.g. on 3.4.1:
[2] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16}
=> []
Maybe for preset classes like \p{...} and [[:alpha:]] we should only warn if one range completely subsumes another?
Updated by jneen (Jeanine Adkisson) 14 days ago
- Description updated (diff)
Updated by mame (Yusuke Endoh) 14 days ago
Updated by mame (Yusuke Endoh) 14 days ago
- Related to Bug #21503: \p{Word} does not match on \p{Join_Control} while docs say it does added
Updated by trinistr (Alexander Bulancov) 14 days ago
Using
/(\p{Word}|\p{S})/is kind of a workaround, but it is slower.
Have you tried a non-capturing group? /(?:\p{Word}|\p{S})/ should have better performance.
Updated by kddnewton (Kevin Newton) 14 days ago
This might be a good opportunity to add the || operator from the Unicode spec (https://www.unicode.org/reports/tr18/#Subtraction_and_Intersection. We could make that one not warn, because it's explicitly desired. As in:
$VERBOSE = true
regex = /[\p{Word}\p{S}]/ # warning
regex = /[\p{Word}||\p{S}]/ # no warning
Updated by jneen (Jeanine Adkisson) 14 days ago
· Edited
trinistr (Alexander Bulancov) wrote in #note-11:
Using
/(\p{Word}|\p{S})/is kind of a workaround, but it is slower.Have you tried a non-capturing group?
/(?:\p{Word}|\p{S})/should have better performance.
This is what I actually tested. Still much slower.
mame (Yusuke Endoh) wrote in #note-9:
jneen (Jeanine Adkisson) wrote in #note-7:
That specific case also appears to have changed, e.g. on 3.4.1:
It is an intentional bug fix. See #21503.
While I understand your trouble, this warning is functioning exactly as intended. How do you suggest resolving it?
I suppose the question is - what is the purpose of a warning here? What fix are you asking the code author to implement? If my downstream users are running with warnings on and Ruby prints 1000 lines of warnings loading my library, what exactly am I being warned about?
Is there a specific danger to using overlapping character classes? Or should this kind of thing live in a linter like Rubocop, which has overrides and toggles?
Updated by maxfelsher (Max Felsher) 13 days ago
If I'm reading the history right, the warning was added in #1831 in order to catch mistakes like a regexp defined as /[:lower:]/ (as opposed to /[[:lower:]]/, I assume). I can see the value in that, but it does seem like there should be a way to list overlapping character classes without a warning (and without turning warnings off completely).
Updated by jneen (Jeanine Adkisson) 13 days ago
That's a very interesting find!
I do think it makes sense to warn if an explicitly written character repeats in a character class, or if the class begins and ends with a colon. But for overlapping unicode properties, there doesn't seem to be any danger in including both in a character class.
That said, there's still an argument that all of this is a job for a linter. Rubocop didn't exist until about a year after #1831 was opened.
Updated by jneen (Jeanine Adkisson) 13 days ago
· Edited
Some benchmarks:
$ ruby --version
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [arm64-darwin25]
require 'benchmark'
LENGTH = 1000000
REPEAT = 100
TEST_STR = 'a' * LENGTH
Benchmark.bm do |bm|
bm.report "char class:" do
REPEAT.times { /[\p{Word}\p{S}]*/o.match?(TEST_STR) }
end
bm.report "alternation:" do
REPEAT.times { /(?:\p{Word}|\p{S})*/o.match?(TEST_STR) }
end
end
output:
user system total real
char class: 0.634908 0.302112 0.937020 ( 0.937089)
alternation: 0.983069 0.449849 1.432918 ( 1.433005)
The alternation syntax is understandably a bit slower, as it would be two nodes in the state machine rather than one unified range test. I expect this effect would be worse when more unicode properties are piled on (as they tend to be in practice), resulting in extra nodes.
Either way, /[\p{Word}\p{S}]/ is a perfectly valid regular expression that as far as I know doesn't have any practical issues, so I don't think it is helpful to warn. Perhaps if one class completely subsumes another (say, /[\p{Alnum}\p{Alpha}]/) but even then I don't think it's particularly helpful, or anything that couldn't be handled by a static linter.
Updated by jneen (Jeanine Adkisson) 13 days ago
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 13 days ago
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 6 days ago
This isn't even possible to work around by targeting RUBY_VERSION, as Ruby warns even in unreachable cases:
regex = if RUBY_VERSION < '4'
/[\p{Word}\p{Cf}]/
else
/[\p{Word}]/
end
still warns on Ruby 4+, even though the code is not reachable in that version.