Project

General

Profile

Actions

Bug #14367

closed

Wrong interpretation of backslash C in regexp literals

Added by shyouhei (Shyouhei Urabe) over 4 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]
[ruby-core:84900]
Tags:

Description

Following ruby code returns nil.

% LC_ALL=C ruby -ve 'p(/\c\xFF/ =~ "\c\xFF")'
ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]
nil

Is this intentional?


Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #18449: Bug in 3.1 regexp literals with \cRejectedActions

Updated by Hanmac (Hans Mackowiak) over 4 years ago

the problem is this:

/\c\xFF/.source == "\\c\\xFF"

which is already escaped

you might want this:

/#{"\c\xFF"}/ == /ƒ/

or use this:

Regexp.compile("\c\xFF")

PS: it is correct that i get this?

"\c\xFF" ==  "\x9F" #=> true

EDIT: this works

/\x9F/ =~ "\c\xFF" #=> 0

Updated by shyouhei (Shyouhei Urabe) over 4 years ago

Hanmac (Hans Mackowiak) wrote:

the problem is this:

/\c\xFF/.source == "\\c\\xFF"

No, I believe that isn't the problem. For instance /\c\x7F/ works.

% LC_ALL=C ruby -ve 'p(/\c\x7F/ =~ "\c\x7F")'
ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]
0

EDIT: this works

/\x9F/ =~ "\c\xFF" #=> 0

Yeah, that's why I titled this issue a "wrong interpretation of backslash C in regexp literals". This is about /...\c.../.

Updated by shyouhei (Shyouhei Urabe) over 2 years ago

Can I have any answer for my question ("Is this intentional?")?

Updated by naruse (Yui NARUSE) over 2 years ago

It looks inconsistency handling between regexp and Ruby's for \c\xff:

%  LC_ALL=C ruby -ve 'p (/\c\xff/ =~ "\x1f")'
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin18]
0

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

The behavior appears not to be intentional. This is a bug related to the fact that Ruby uses a recursive algorithm for strings (read_escape) but not for regexps (tokadd_escape). I've submitted a pull request to have control/meta handling for regexps use the same recursive algorithm used for strings, which fixes this issue: https://github.com/ruby/ruby/pull/4495

Actions #6

Updated by jeremyevans (Jeremy Evans) over 1 year ago

  • Status changed from Open to Closed

Applied in changeset git|11ae581a4a7f5d5f5ec6378872eab8f25381b1b9.


Fix handling of control/meta escapes in literal regexps

Ruby uses a recursive algorithm for handling control/meta escapes
in strings (read_escape). However, the equivalent code for regexps
(tokadd_escape) in did not use a recursive algorithm. Due to this,
Handling of control/meta escapes in regexp did not have the same
behavior as in strings, leading to behavior such as the following
returning nil:

/\c\xFF/ =~ "\c\xFF"

Switch the code for handling \c, \C and \M in literal regexps to
use the same code as for strings (read_escape), to keep behavior
consistent between the two.

Fixes [Bug #14367]

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

nobu (Nobuyoshi Nakada) wrote in #note-7:

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

jeremyevans0 (Jeremy Evans) wrote in #note-8:

nobu (Nobuyoshi Nakada) wrote in #note-7:

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?

My previous statement was incorrect. The reason it worked before is that \c behavior in regexps was wrong and did not result in the 8-bit character it should have. If you used a character resulting in a high bit, you did get the same error:

$ LANG=en_US.UTF-8 ruby -vce '/\M-a/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: too short escaped multibyte character: /\M-a/
-e:1: warning: possibly useless use of a literal in void context

You would also get an error if you created a regexp using a string instead of using a literal regexp:

$ LANG=en_US.UTF-8 ruby -ve '/#{s="\c\xff"}/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: warning: possibly useless use of a literal in void context
-e:1:in `<main>': invalid multibyte character (ArgumentError)

So I don't think anything is broken on UTF-8 (or other encodings). Before, it should have raised an error and it didn't because the incorrect algorithm resulted in the wrong character. Now it raises an error as it should.

Actions #10

Updated by mame (Yusuke Endoh) 9 months ago

  • Related to Bug #18449: Bug in 3.1 regexp literals with \c added
Actions

Also available in: Atom PDF