Project

General

Profile

Bug #3386

Inconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings

Added by jyeung (Jeffrey Yeung) over 9 years ago. Updated over 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
ruby -v:
ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
[ruby-core:30579]

Description

=begin
Scenario:


Use a Regexp pattern that includes the [:punct:] character class (or the \p{Punct} expression) on strings containing only standard punctuation characters `~!@#$%&*()_+-=[]{}|;':",./<>?.

Issue:


The match results on UTF-8 encoded strings is unexpectedly different from ASCII encoded strings.

I have observed two issues:

  • The [[:punct:]] expression does not match characters `~$+=|<> when applied to UTF-8 strings.
  • The \p{Punct} and the \P{Punct} expressions indicate different results when applied to UTF-8 strings - the latter (\P{Punct}) seems to be incorrect.

To illustrate these, here is a bit of Ruby code:
teststr = '`~!@#$%&*()_+-=[]\{}|;\':",./<>?'
teststr2 = teststr.encode('UTF-8')
teststr3 = teststr.encode('ASCII-8BIT')

def gsub_tests(teststr)
puts "String (#{teststr.encoding}): \'#{teststr}\'"
strout1 = teststr.gsub(/[[:punct:]]/, '')
strout2 = teststr.gsub(/[[:punct:]]/, '')
strout3 = teststr.gsub(/\p{Punct}/, '')
strout4 = teststr.gsub(/\p{Punct}/, '')
strout5 = teststr.gsub(/\P{Punct}/, '')
puts " Output 1 = \'#{strout1}\'"
puts " Output 2 = \'#{strout2}\'"
puts " Output 3 = \'#{strout3}\'"
puts " Output 4 = \'#{strout4}\'"
puts " Output 5 = \'#{strout5}\'"
end

gsub_tests(teststr)
gsub_tests(teststr2)
gsub_tests(teststr3)

Here is output I observe when running the above code:
$ ruby test.rb
String (US-ASCII): '~!@#$%^&*()_+-=[]\{}|;':",./<>?'
Output 1 = ''
Output 2 = '
~!@#$%&*()_+-=[]{}|;':",./<>?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?'
Output 5 = '
~!@#$%&*()_+-=[]{}|;':",./<>?'
String (UTF-8): '~!@#$%^&*()_+-=[]\{}|;':",./<>?'
Output 1 = '
~$+=|<>'
Output 2 = '!@#%&()_-[]{};':",./?'
Output 3 = ''
Output 4 = '`~!@#$%&
()+-=[]{}|;':",./<>?'
Output 5 = '!@#%&*()
-[]{};':",./?'
String (ASCII-8BIT): '~!@#$%^&*()_+-=[]\{}|;':",./<>?'
Output 1 = ''
Output 2 = '
~!@#$%&*()_+-=[]{}|;':",./<>?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?'
Output 5 = '
~!@#$%&*()_+-=[]{}|;':",./<>?'

Note test outputs 1, 2, and 5 for the UTF-8 encoded string above.
=end


Related issues

Is duplicate of Ruby master - Bug #3217: Regexp fails to match string with '<' when encoding is UTF-8Rejected04/29/2010Actions

History

#1

Updated by naruse (Yui NARUSE) over 9 years ago

  • Status changed from Open to Rejected

=begin
It is from Unicode, so this is spec.
http://www.unicode.org/reports/tr18/
=end

Also available in: Atom PDF