Project

General

Profile

Feature #19930

Updated by nobu (Nobuyoshi Nakada) 7 months ago

cf. https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-Character+Classes 

 > POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category. 

 Reading this description, we globally expect that metacharacters are ASCII only and that POSIX _bracket expressions_ are Unicode aware. But as _bracket expressions_ are POSIX compliant, for example `[:xdigit:]` use only ASCII range `[A-Fa-f0-9]` and not the `Hex_Digit` Unicode property that is also including the Halfwidth and Fullwidth Forms Number Decimal like `0` (U+FF10, FULLWIDTH DIGIT ZERO). So the above description is confusing as we would expect `[[:xdigit:]]` [[:xdigit:]]` to _encompass non-ASCII characters_ too. On the contrary `[:space:]` will look for `[\p{Z}\t\r\n\v\f]` (`\s` plus `\p{Z}` (Separator)) while the description is talking only about `[:blank:], newline, carriage return`. 

 My point is, in the end, that it's hard to determine what to expect as ranges for character classes while reading the Ruby Regexp documentation alone. To know what is the exact behavior I'll have to read the source code or at least reading the POSIX spec. 

 My feature request is about adding a comparison table like the one on https://www.regular-expressions.info/posixbrackets.html (for Java) with: the POSIX bracket expression, the description, the ASCII exact range, the Unicode exact range, the shorthand metacharacter (ASCII), the long escape sequence (Unicode). So we could know precisely what to expect by reading the doc.

Back