Bug #10891: /[[:punct:]]/ POSIX group broken (with string literals?) - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #10891

closed

/[[:punct:]]/ POSIX group broken (with string literals?)

Added by tom-lord (Tom Lord) over 10 years ago. Updated about 6 years ago.

Status:

Closed

Assignee:

naruse (Yui NARUSE)

Target version:

ruby -v:

ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux]

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN

[ruby-core:68254]

Description

The regular expression: /[[:punct:]]/ should match the following characters:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

However, it only works for these characters:

! " # % & ' ( ) * , - . / : ; ? @ [ \\ ] _ { }

And does not work for these characters:

$ + < = > ^ ` | ~

However, this is where it gets really weird... Consider the following:

60.chr == "<" # true
60.chr =~ /[[:punct:]]/ # => 0
"<" =~ /[[:punct:]]/ # => nil

So, it seems that the regular expression only fails for string literals!

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

Description updated (diff)

It occurs with UTF-8 encoding only.

Actions

Copy link

#2 [ruby-core:68263]

Updated by tom-lord (Tom Lord) over 10 years ago

Nobuyoshi Nakada wrote:

It occurs with UTF-8 encoding only.

Ahhhhh, of course - that's what the difference between 60.chr and "<" is!

Like you said, the issue only affects UTF-8 encodings:

#<Encoding:UTF-8>, #<Encoding:UTF8-MAC>, #<Encoding:UTF8-DoCoMo>, #<Encoding:UTF8-KDDI>, #<Encoding:UTF8-SoftBank>

Actions

Copy link

#3 [ruby-core:68280]

Updated by tom-lord (Tom Lord) over 10 years ago

On further investigation, this is a known issue in Onigmo (Ruby 2.x's regexp parser).

However, it was apparently "fixed" way back in 2006: https://github.com/k-takata/Onigmo/blob/d0b3173893b9499a4e53ae1da16ba76c06d85571/HISTORY#L584-585 (Note: I can't find a reference to any Oniguruma/Onigmo source control dating back this far, to see the actual commit)

...And yet, it remains an open issue: https://github.com/k-takata/Onigmo/issues/42

Actions

Copy link

#4 [ruby-core:71742]

Updated by shugo (Shugo Maeda) over 9 years ago

Assignee changed from core to naruse (Yui NARUSE)

How about to interpret [[:punct]] as [\p{P}\p{S}] for unicode strings so that [[:punct]] will be a superset of POSIX's one?

Actions

Copy link

#5 [ruby-core:71746]

Updated by naruse (Yui NARUSE) over 9 years ago

Status changed from Open to Feedback

It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct

Actions

Copy link

#6 [ruby-core:71756]

Updated by shugo (Shugo Maeda) over 9 years ago

Yui NARUSE wrote:

It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct

In general, it would be a reasonable choice.

However, in Ruby, the problem is that it's hard to guess the programmers intention from code,
because the behavior is decided not by the regular expression, but by the target string.

def do_something(s)
  ...
  if /[[:punct:]]/ =~ s  # should "<" match, or shouldn't?
    ...
  end
  ...
end

If you want to reject symbols, /\p{P}/ can be used instead, and it's more readable.

Actions

Copy link

#7 [ruby-core:93600]

Updated by jeremyevans0 (Jeremy Evans) about 6 years ago

Status changed from Feedback to Closed

This was apparently fixed between Ruby 2.3 and 2.4:

$ ruby23 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)' 
nil
$ ruby24 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)' 
0

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #10891

/[[:punct:]]/ POSIX group broken (with string literals?)

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

Updated by tom-lord (Tom Lord) over 10 years ago

Updated by tom-lord (Tom Lord) over 10 years ago

Updated by shugo (Shugo Maeda) over 9 years ago

Updated by naruse (Yui NARUSE) over 9 years ago

Updated by shugo (Shugo Maeda) over 9 years ago

Updated by jeremyevans0 (Jeremy Evans) about 6 years ago