Bug #14367: Wrong interpretation of backslash C in regexp literals - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #14367

closed

Wrong interpretation of backslash C in regexp literals

Added by shyouhei (Shyouhei Urabe) over 7 years ago. Updated about 4 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]

Backport:

2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN

[ruby-core:84900]

Tags:

regexp

Description

Following ruby code returns nil.

% LC_ALL=C ruby -ve 'p(/\c\xFF/ =~ "\c\xFF")'
ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]
nil

Is this intentional?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

#1 [ruby-core:84904]

Updated by Hanmac (Hans Mackowiak) over 7 years ago

the problem is this:

/\c\xFF/.source == "\\c\\xFF"

which is already escaped

you might want this:

/#{"\c\xFF"}/ == /ƒ/

or use this:

Regexp.compile("\c\xFF")

PS: it is correct that i get this?

"\c\xFF" ==  "\x9F" #=> true

EDIT: this works

/\x9F/ =~ "\c\xFF" #=> 0

Actions

Copy link

#2 [ruby-core:84905]

Updated by shyouhei (Shyouhei Urabe) over 7 years ago

Hanmac (Hans Mackowiak) wrote:

the problem is this:
/\c\xFF/.source == "\\c\\xFF"

No, I believe that isn't the problem. For instance /\c\x7F/ works.

% LC_ALL=C ruby -ve 'p(/\c\x7F/ =~ "\c\x7F")'
ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]
0

EDIT: this works
/\x9F/ =~ "\c\xFF" #=> 0

Yeah, that's why I titled this issue a "wrong interpretation of backslash C in regexp literals". This is about /...\c.../.

Actions

Copy link

#3 [ruby-core:97994]

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Can I have any answer for my question ("Is this intentional?")?

Actions

Copy link

#4 [ruby-core:98181]

Updated by naruse (Yui NARUSE) over 5 years ago

It looks inconsistency handling between regexp and Ruby's for \c\xff:

%  LC_ALL=C ruby -ve 'p (/\c\xff/ =~ "\x1f")'
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin18]
0

Actions

Copy link

#5 [ruby-core:103807]

Updated by jeremyevans0 (Jeremy Evans) about 4 years ago

The behavior appears not to be intentional. This is a bug related to the fact that Ruby uses a recursive algorithm for strings (read_escape) but not for regexps (tokadd_escape). I've submitted a pull request to have control/meta handling for regexps use the same recursive algorithm used for strings, which fixes this issue: https://github.com/ruby/ruby/pull/4495

Actions

Copy link

Updated by jeremyevans (Jeremy Evans) about 4 years ago

Status changed from Open to Closed

Applied in changeset git|11ae581a4a7f5d5f5ec6378872eab8f25381b1b9.

Fix handling of control/meta escapes in literal regexps

Ruby uses a recursive algorithm for handling control/meta escapes
in strings (read_escape). However, the equivalent code for regexps
(tokadd_escape) in did not use a recursive algorithm. Due to this,
Handling of control/meta escapes in regexp did not have the same
behavior as in strings, leading to behavior such as the following
returning nil:

/\c\xFF/ =~ "\c\xFF"

Switch the code for handling \c, \C and \M in literal regexps to
use the same code as for strings (read_escape), to keep behavior
consistent between the two.

Fixes [Bug #14367]

Actions

Copy link

#7 [ruby-core:103814]

Updated by nobu (Nobuyoshi Nakada) about 4 years ago

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

Actions

Copy link

#8 [ruby-core:103815]

Updated by jeremyevans0 (Jeremy Evans) about 4 years ago

nobu (Nobuyoshi Nakada) wrote in #note-7:

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?

Actions

Copy link

#9 [ruby-core:103836]

Updated by jeremyevans0 (Jeremy Evans) about 4 years ago

jeremyevans0 (Jeremy Evans) wrote in #note-8:

nobu (Nobuyoshi Nakada) wrote in #note-7:
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context
The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?

My previous statement was incorrect. The reason it worked before is that \c behavior in regexps was wrong and did not result in the 8-bit character it should have. If you used a character resulting in a high bit, you did get the same error:

$ LANG=en_US.UTF-8 ruby -vce '/\M-a/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: too short escaped multibyte character: /\M-a/
-e:1: warning: possibly useless use of a literal in void context

You would also get an error if you created a regexp using a string instead of using a literal regexp:

$ LANG=en_US.UTF-8 ruby -ve '/#{s="\c\xff"}/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: warning: possibly useless use of a literal in void context
-e:1:in `<main>': invalid multibyte character (ArgumentError)

So I don't think anything is broken on UTF-8 (or other encodings). Before, it should have raised an error and it didn't because the incorrect algorithm resulted in the wrong character. Now it raises an error as it should.

Actions

Copy link

#10

Updated by mame (Yusuke Endoh) over 3 years ago

Related to Bug #18449: Bug in 3.1 regexp literals with \c added

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #14367

Wrong interpretation of backslash C in regexp literals

Updated by Hanmac (Hans Mackowiak) over 7 years ago

Updated by shyouhei (Shyouhei Urabe) over 7 years ago

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Updated by naruse (Yui NARUSE) over 5 years ago

Updated by jeremyevans0 (Jeremy Evans) about 4 years ago

Updated by jeremyevans (Jeremy Evans) about 4 years ago

Updated by nobu (Nobuyoshi Nakada) about 4 years ago

Updated by jeremyevans0 (Jeremy Evans) about 4 years ago

Updated by jeremyevans0 (Jeremy Evans) about 4 years ago

Updated by mame (Yusuke Endoh) over 3 years ago