Project

General

Profile

Actions

Bug #19867

closed

Unicode line and paragraph separator are not stripped

Bug #19867: Unicode line and paragraph separator are not stripped

Added by iainbeeston (Iain Beeston) about 2 years ago. Updated about 2 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]
[ruby-core:114662]

Description

Unicode newline and paragraph separators are not removed by any of the strip methods:

"\u2028\u2029\u0000\t\n\v\f\r ".strip # => "\u2028\u2029"

I would have expected strip (and lstrip, rstrip) to remove unicode whitespace as well. It looks like #7154 reported something similar but for regular expressions and way back In ruby 1.9.

I think that fixing this should be simple (just checking for \x2028 and \x2029 in ctype.h) but I'm not sure if it's supposed to behave this way or if changing it could introduce unexpected consequences.

Updated by iainbeeston (Iain Beeston) about 2 years ago Actions #1 [ruby-core:114663]

I can see that the [[:space:]] regex class does match unicode whitespace characters ("\u2028" =~ /[[:space:]]/ # => 0) but \s does not ("\u2028" =~ /\s/ # => nil)

Updated by nobu (Nobuyoshi Nakada) about 2 years ago Actions #2 [ruby-core:114664]

Yes, \s, \w etc match only single-byte ASCII characters.
I don't think changing the behavior by default is good idea.
An optional (keyword) argument may be better.

Updated by nobu (Nobuyoshi Nakada) about 2 years ago Actions #3 [ruby-core:114665]

As for the implementation, changing ctype.h is not desirable.
There is rb_enc_isspace function for such purpose already.

Updated by nobu (Nobuyoshi Nakada) about 2 years ago Actions #4

  • Status changed from Open to Rejected
Actions

Also available in: PDF Atom