Bug #3386
closedInconsistent regexp punct class matching behavior between UTF-8 and ASCII encodings
Description
=begin
Scenario:
Use a Regexp pattern that includes the [:punct:] character class (or the \p{Punct} expression) on strings containing only standard punctuation characters `~!@#$%^&*()_+-=[]{}|;':",./<>?.
Issue:¶
The match results on UTF-8 encoded strings is unexpectedly different from ASCII encoded strings.
I have observed two issues:
- The [[:punct:]] expression does not match characters `~$^+=|<> when applied to UTF-8 strings.
- The \p{^Punct} and the \P{Punct} expressions indicate different results when applied to UTF-8 strings - the latter (\P{Punct}) seems to be incorrect.
To illustrate these, here is a bit of Ruby code:
teststr = '`~!@#$%^&*()_+-=[]\{}|;':",./<>?'
teststr2 = teststr.encode('UTF-8')
teststr3 = teststr.encode('ASCII-8BIT')
def gsub_tests(teststr)
puts "String (#{teststr.encoding}): '#{teststr}'"
strout1 = teststr.gsub(/[[:punct:]]/, '')
strout2 = teststr.gsub(/[^[:punct:]]/, '')
strout3 = teststr.gsub(/\p{Punct}/, '')
strout4 = teststr.gsub(/\p{^Punct}/, '')
strout5 = teststr.gsub(/\P{Punct}/, '')
puts " Output 1 = '#{strout1}'"
puts " Output 2 = '#{strout2}'"
puts " Output 3 = '#{strout3}'"
puts " Output 4 = '#{strout4}'"
puts " Output 5 = '#{strout5}'"
end
gsub_tests(teststr)
gsub_tests(teststr2)
gsub_tests(teststr3)
Here is output I observe when running the above code:
$ ruby test.rb
String (US-ASCII): '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 1 = '' Output 2 = '
~!@#$%^&()_+-=[]{}|;':",./<>?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 5 = '
~!@#$%^&()+-=[]{}|;':",./<>?'
String (UTF-8): '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 1 = '
~$^+=|<>'
Output 2 = '!@#%&*()-[]{};':",./?'
Output 3 = ''
Output 4 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 5 = '!@#%&*()_-[]\{};':",./?' String (ASCII-8BIT): '
~!@#$%^&()_+-=[]{}|;':",./<>?'
Output 1 = ''
Output 2 = '~!@#$%^&*()_+-=[]\{}|;':",./<>?' Output 3 = '' Output 4 = '
~!@#$%^&()+-=[]{}|;':",./<>?'
Output 5 = '`~!@#$%^&*()+-=[]{}|;':",./<>?'
Note test outputs 1, 2, and 5 for the UTF-8 encoded string above.
=end
Updated by naruse (Yui NARUSE) over 14 years ago
- Status changed from Open to Rejected
=begin
It is from Unicode, so this is spec.
http://www.unicode.org/reports/tr18/
=end