Bug #10891: /[[:punct:]]/ POSIX group broken (with string literals?) - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #10891

closed

/[[:punct:]]/ POSIX group broken (with string literals?)

Bug #10891: /[[:punct:]]/ POSIX group broken (with string literals?)

Added by tom-lord (Tom Lord) over 11 years ago. Updated almost 7 years ago.

Status:

Closed

Assignee:

naruse (Yui NARUSE)

Target version:

ruby -v:

ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux]

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN

[ruby-core:68254]

Description

The regular expression: /[[:punct:]]/ should match the following characters:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

However, it only works for these characters:

! " # % & ' ( ) * , - . / : ; ? @ [ \\ ] _ { }

And does not work for these characters:

$ + < = > ^ ` | ~

However, this is where it gets really weird... Consider the following:

60.chr == "<" # true
60.chr =~ /[[:punct:]]/ # => 0
"<" =~ /[[:punct:]]/ # => nil

So, it seems that the regular expression only fails for string literals!

Updated by nobu (Nobuyoshi Nakada) over 11 years ago Actions
Copy link
#1

Description updated (diff)

It occurs with UTF-8 encoding only.

Updated by tom-lord (Tom Lord) over 11 years ago Actions
Copy link
#2 [ruby-core:68263]

Nobuyoshi Nakada wrote:

It occurs with UTF-8 encoding only.

Ahhhhh, of course - that's what the difference between 60.chr and "<" is!

Like you said, the issue only affects UTF-8 encodings:

#<Encoding:UTF-8>, #<Encoding:UTF8-MAC>, #<Encoding:UTF8-DoCoMo>, #<Encoding:UTF8-KDDI>, #<Encoding:UTF8-SoftBank>

Updated by tom-lord (Tom Lord) over 11 years ago Actions
Copy link
#3 [ruby-core:68280]

On further investigation, this is a known issue in Onigmo (Ruby 2.x's regexp parser).

However, it was apparently "fixed" way back in 2006: https://github.com/k-takata/Onigmo/blob/d0b3173893b9499a4e53ae1da16ba76c06d85571/HISTORY#L584-585 (Note: I can't find a reference to any Oniguruma/Onigmo source control dating back this far, to see the actual commit)

...And yet, it remains an open issue: https://github.com/k-takata/Onigmo/issues/42

Updated by shugo (Shugo Maeda) over 10 years ago Actions
Copy link
#4 [ruby-core:71742]

Assignee changed from core to naruse (Yui NARUSE)

How about to interpret [[:punct]] as [\p{P}\p{S}] for unicode strings so that [[:punct]] will be a superset of POSIX's one?

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#5 [ruby-core:71746]

Status changed from Open to Feedback

It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct

Updated by shugo (Shugo Maeda) over 10 years ago Actions
Copy link
#6 [ruby-core:71756]

Yui NARUSE wrote:

It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct

In general, it would be a reasonable choice.

However, in Ruby, the problem is that it's hard to guess the programmers intention from code,
because the behavior is decided not by the regular expression, but by the target string.

def do_something(s)
  ...
  if /[[:punct:]]/ =~ s  # should "<" match, or shouldn't?
    ...
  end
  ...
end

If you want to reject symbols, /\p{P}/ can be used instead, and it's more readable.

Updated by jeremyevans0 (Jeremy Evans) almost 7 years ago Actions
Copy link
#7 [ruby-core:93600]

Status changed from Feedback to Closed

This was apparently fixed between Ruby 2.3 and 2.4:

$ ruby23 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)' 
nil
$ ruby24 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)' 
0

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #10891

/[[:punct:]]/ POSIX group broken (with string literals?)

Updated by nobu (Nobuyoshi Nakada) over 11 years ago Actions
Copy link
#1

Updated by tom-lord (Tom Lord) over 11 years ago Actions
Copy link
#2 [ruby-core:68263]

Updated by tom-lord (Tom Lord) over 11 years ago Actions
Copy link
#3 [ruby-core:68280]

Updated by shugo (Shugo Maeda) over 10 years ago Actions
Copy link
#4 [ruby-core:71742]

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#5 [ruby-core:71746]

Updated by shugo (Shugo Maeda) over 10 years ago Actions
Copy link
#6 [ruby-core:71756]

Updated by jeremyevans0 (Jeremy Evans) almost 7 years ago Actions
Copy link
#7 [ruby-core:93600]

Project

General

Profile

Ruby

Custom queries

Bug #10891

/[[:punct:]]/ POSIX group broken (with string literals?)

Updated by nobu (Nobuyoshi Nakada) over 11 years ago ActionsCopy link #1

Updated by tom-lord (Tom Lord) over 11 years ago ActionsCopy link #2 [ruby-core:68263]

Updated by tom-lord (Tom Lord) over 11 years ago ActionsCopy link #3 [ruby-core:68280]

Updated by shugo (Shugo Maeda) over 10 years ago ActionsCopy link #4 [ruby-core:71742]

Updated by naruse (Yui NARUSE) over 10 years ago ActionsCopy link #5 [ruby-core:71746]

Updated by shugo (Shugo Maeda) over 10 years ago ActionsCopy link #6 [ruby-core:71756]

Updated by jeremyevans0 (Jeremy Evans) almost 7 years ago ActionsCopy link #7 [ruby-core:93600]

Updated by nobu (Nobuyoshi Nakada) over 11 years ago Actions
Copy link
#1

Updated by tom-lord (Tom Lord) over 11 years ago Actions
Copy link
#2 [ruby-core:68263]

Updated by tom-lord (Tom Lord) over 11 years ago Actions
Copy link
#3 [ruby-core:68280]

Updated by shugo (Shugo Maeda) over 10 years ago Actions
Copy link
#4 [ruby-core:71742]

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#5 [ruby-core:71746]

Updated by shugo (Shugo Maeda) over 10 years ago Actions
Copy link
#6 [ruby-core:71756]

Updated by jeremyevans0 (Jeremy Evans) almost 7 years ago Actions
Copy link
#7 [ruby-core:93600]