Bug #3386: Inconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings - Backport191 - Ruby Issue Tracking System

Actions

Copy link

Bug #3386

closed

Inconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings

Added by jyeung (Jeffrey Yeung) over 14 years ago. Updated over 13 years ago.

Status:

Rejected

Assignee:

ruby -v:

ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]

[ruby-core:30579]

Description

=begin
Scenario:

Use a Regexp pattern that includes the [:punct:] character class (or the \p{Punct} expression) on strings containing only standard punctuation characters `~!@#$%^&*()_+-=[]{}|;':",./<>?.

Issue:¶

The match results on UTF-8 encoded strings is unexpectedly different from ASCII encoded strings.

I have observed two issues:

The [[:punct:]] expression does not match characters `~$^+=|<> when applied to UTF-8 strings.
The \p{^Punct} and the \P{Punct} expressions indicate different results when applied to UTF-8 strings - the latter (\P{Punct}) seems to be incorrect.

To illustrate these, here is a bit of Ruby code:
teststr = '`~!@#$%^&*()_+-=[]\{}|;':",./<>?'
teststr2 = teststr.encode('UTF-8')
teststr3 = teststr.encode('ASCII-8BIT')

def gsub_tests(teststr)
puts "String (#{teststr.encoding}): '#{teststr}'"
strout1 = teststr.gsub(/[[:punct:]]/, '')
strout2 = teststr.gsub(/[^[:punct:]]/, '')
strout3 = teststr.gsub(/\p{Punct}/, '')
strout4 = teststr.gsub(/\p{^Punct}/, '')
strout5 = teststr.gsub(/\P{Punct}/, '')
puts " Output 1 = '#{strout1}'"
puts " Output 2 = '#{strout2}'"
puts " Output 3 = '#{strout3}'"
puts " Output 4 = '#{strout4}'"
puts " Output 5 = '#{strout5}'"
end

gsub_tests(teststr)
gsub_tests(teststr2)
gsub_tests(teststr3)

Here is output I observe when running the above code:
$ ruby test.rb
String (US-ASCII): '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 1 = '' Output 2 = '~!@#$%^&()_+-=[]{}|;':",./<>?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 5 = '~!@#$%^&()+-=[]{}|;':",./<>?'
String (UTF-8): '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 1 = '~$^+=|<>'
Output 2 = '!@#%&*()-[]{};':",./?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 5 = '!@#%&*()_-[]\{};':",./?' String (ASCII-8BIT): '~!@#$%^&()_+-=[]{}|;':",./<>?'
Output 1 = ''
Output 2 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 3 = '' Output 4 = '~!@#$%^&()+-=[]{}|;':",./<>?'
Output 5 = '`~!@#$%^&*()+-=[]{}|;':",./<>?'

Note test outputs 1, 2, and 5 for the UTF-8 encoded string above.
=end

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Like0

Project

General

Profile

Ruby » Backport191

Custom queries

Bug #3386

Inconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings

=begin
Scenario:

Issue:¶

Updated by naruse (Yui NARUSE) over 14 years ago

Project

General

Profile

Ruby » Backport191

Custom queries

Bug #3386

Inconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings

=begin Scenario:

Issue:¶

Updated by naruse (Yui NARUSE) over 14 years ago

=begin
Scenario: