Project

General

Profile

Actions

Bug #10891

closed

/[[:punct:]]/ POSIX group broken (with string literals?)

Added by tom-lord (Tom Lord) almost 8 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Target version:
-
ruby -v:
ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux]
[ruby-core:68254]

Description

The regular expression: /[[:punct:]]/ should match the following characters:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

However, it only works for these characters:

! " # % & ' ( ) * , - . / : ; ? @ [ \\ ] _ { }

And does not work for these characters:

$ + < = > ^ ` | ~

However, this is where it gets really weird... Consider the following:

60.chr == "<" # true
60.chr =~ /[[:punct:]]/ # => 0
"<" =~ /[[:punct:]]/ # => nil

So, it seems that the regular expression only fails for string literals!

Actions #1

Updated by nobu (Nobuyoshi Nakada) almost 8 years ago

  • Description updated (diff)

It occurs with UTF-8 encoding only.

Updated by tom-lord (Tom Lord) almost 8 years ago

Nobuyoshi Nakada wrote:

It occurs with UTF-8 encoding only.

Ahhhhh, of course - that's what the difference between 60.chr and "<" is!

Like you said, the issue only affects UTF-8 encodings:

#<Encoding:UTF-8>, #<Encoding:UTF8-MAC>, #<Encoding:UTF8-DoCoMo>, #<Encoding:UTF8-KDDI>, #<Encoding:UTF8-SoftBank>

Updated by tom-lord (Tom Lord) almost 8 years ago

On further investigation, this is a known issue in Onigmo (Ruby 2.x's regexp parser).

However, it was apparently "fixed" way back in 2006: https://github.com/k-takata/Onigmo/blob/d0b3173893b9499a4e53ae1da16ba76c06d85571/HISTORY#L584-585 (Note: I can't find a reference to any Oniguruma/Onigmo source control dating back this far, to see the actual commit)

...And yet, it remains an open issue: https://github.com/k-takata/Onigmo/issues/42

Updated by shugo (Shugo Maeda) about 7 years ago

  • Assignee changed from ruby-core to naruse (Yui NARUSE)

How about to interpret [[:punct]] as [\p{P}\p{S}] for unicode strings so that [[:punct]] will be a superset of POSIX's one?

Updated by naruse (Yui NARUSE) about 7 years ago

  • Status changed from Open to Feedback

It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct

Updated by shugo (Shugo Maeda) about 7 years ago

Yui NARUSE wrote:

It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct

In general, it would be a reasonable choice.

However, in Ruby, the problem is that it's hard to guess the programmers intention from code,
because the behavior is decided not by the regular expression, but by the target string.

def do_something(s)
  ...
  if /[[:punct:]]/ =~ s  # should "<" match, or shouldn't?
    ...
  end
  ...
end

If you want to reject symbols, /\p{P}/ can be used instead, and it's more readable.

Updated by jeremyevans0 (Jeremy Evans) over 3 years ago

  • Status changed from Feedback to Closed

This was apparently fixed between Ruby 2.3 and 2.4:

$ ruby23 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)' 
nil
$ ruby24 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)' 
0
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0