Project

General

Profile

Actions

Bug #3386

closed

Inconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings

Added by jyeung (Jeffrey Yeung) over 14 years ago. Updated over 13 years ago.

Status:
Rejected
Assignee:
-
ruby -v:
ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
[ruby-core:30579]

Description

=begin
Scenario:

Use a Regexp pattern that includes the [:punct:] character class (or the \p{Punct} expression) on strings containing only standard punctuation characters `~!@#$%^&*()_+-=[]{}|;':",./<>?.

Issue:

The match results on UTF-8 encoded strings is unexpectedly different from ASCII encoded strings.

I have observed two issues:

  • The [[:punct:]] expression does not match characters `~$^+=|<> when applied to UTF-8 strings.
  • The \p{^Punct} and the \P{Punct} expressions indicate different results when applied to UTF-8 strings - the latter (\P{Punct}) seems to be incorrect.

To illustrate these, here is a bit of Ruby code:
teststr = '`~!@#$%^&*()_+-=[]\{}|;':",./<>?'
teststr2 = teststr.encode('UTF-8')
teststr3 = teststr.encode('ASCII-8BIT')

def gsub_tests(teststr)
puts "String (#{teststr.encoding}): '#{teststr}'"
strout1 = teststr.gsub(/[[:punct:]]/, '')
strout2 = teststr.gsub(/[^[:punct:]]/, '')
strout3 = teststr.gsub(/\p{Punct}/, '')
strout4 = teststr.gsub(/\p{^Punct}/, '')
strout5 = teststr.gsub(/\P{Punct}/, '')
puts " Output 1 = '#{strout1}'"
puts " Output 2 = '#{strout2}'"
puts " Output 3 = '#{strout3}'"
puts " Output 4 = '#{strout4}'"
puts " Output 5 = '#{strout5}'"
end

gsub_tests(teststr)
gsub_tests(teststr2)
gsub_tests(teststr3)

Here is output I observe when running the above code:
$ ruby test.rb
String (US-ASCII): '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 1 = '' Output 2 = '~!@#$%^&()_+-=[]{}|;':",./<>?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 5 = '~!@#$%^&
()+-=[]{}|;':",./<>?'
String (UTF-8): '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 1 = '~$^+=|<>'
Output 2 = '!@#%&*()
-[]{};':",./?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 5 = '!@#%&*()_-[]\{};':",./?' String (ASCII-8BIT): '~!@#$%^&()_+-=[]{}|;':",./<>?'
Output 1 = ''
Output 2 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 3 = '' Output 4 = '~!@#$%^&
()+-=[]{}|;':",./<>?'
Output 5 = '`~!@#$%^&*()
+-=[]{}|;':",./<>?'

Note test outputs 1, 2, and 5 for the UTF-8 encoded string above.
=end


Related issues 1 (0 open1 closed)

Is duplicate of Ruby master - Bug #3217: Regexp fails to match string with '<' when encoding is UTF-8Rejectednaruse (Yui NARUSE)04/29/2010Actions
Actions #1

Updated by naruse (Yui NARUSE) over 14 years ago

  • Status changed from Open to Rejected

=begin
It is from Unicode, so this is spec.
http://www.unicode.org/reports/tr18/
=end

Actions

Also available in: Atom PDF

Like0
Like0