Bug #21709: Regexp interpolation is inconsistent with String interpolation - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #21709

open

Regexp interpolation is inconsistent with String interpolation

Bug #21709: Regexp interpolation is inconsistent with String interpolation

Added by thyresias (Thierry Lambert) 4 months ago. Updated 17 days ago.

Status:

Open

Assignee:

Target version:

ruby -v:

ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt]

Backport:

3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN

[ruby-core:123894]

Description

%w(foo être).each do |s|
  puts "string: #{s.inspect} -> #{s.encoding}"
  puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}"
end

Output:

string: "foo" -> UTF-8
escaped: "foo" -> US-ASCII
string: "être" -> UTF-8
escaped: "être" -> UTF-8

The result should always match the encoding of the argument.

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#1 [ruby-core:123895]

Status changed from Open to Feedback

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

    if (ascii_only) {
        rb_enc_associate(tmp, rb_usascii_encoding());
    }

US-ASCII strings will be automatically converted to UTF-8 if necessary:

("foo".encode("US-ASCII") + "\u1234").encoding
# => #<Encoding:UTF-8>

Does this behavior cause any problems in your application?

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#2 [ruby-core:123896]

Does this behavior cause any problems in your application?

Yes:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#3 [ruby-core:123897]

Status changed from Feedback to Open

thyresias (Thierry Lambert) wrote in #note-2:

Does this behavior cause any problems in your application?

Yes:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)

Thank you for providing an example. This seems more like an issue with the literal Regexp support in general than with Regexp.escape. You can trigger the issue without Regexp.escape:

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8

It seems to require you specify unicode properties inside an interpolated string that isn't in UTF-8.

You get a different error without that unicode character at the end:

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}/
# invalid character property name {In_Arabic}: /\p{In_Arabic}/

Using Regexp.new instead of a literal Regexp may work around the issue:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = Regexp.new("^#{s_search}|(?<=– |: )#{s_search}")

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#4 [ruby-core:123898]

Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#5 [ruby-core:123899]

thyresias (Thierry Lambert) wrote in #note-4:

Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^

I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.

In general, US-ASCII strings are implicitly convertible to UTF-8 strings, so having Regexp.escape return a US-ASCII string for data that is solely US-ASCII is reasonable. This implicit use of US-ASCII happens in other cases:

# Literal Symbol
$ ruby -e "p :a.encoding"
#<Encoding:US-ASCII>

# Array#join
$ ruby -e "p [].join.encoding"
#<Encoding:US-ASCII>

# Literal Regexp
$ ruby -e "p //.encoding"
#<Encoding:US-ASCII>

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#6 [ruby-core:123903]

jeremyevans0 (Jeremy Evans) wrote in #note-5:

I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.

Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#7 [ruby-core:123909]

thyresias (Thierry Lambert) wrote in #note-6:

Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?

Sure, that sounds like a good idea.

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#8 [ruby-core:123931]

Subject changed from Inconsistent encoding by Regexp.escape to Regexp interpolation is inconsistent with String interpolation

jeremyevans0 (Jeremy Evans) wrote in #note-7:

Sure, that sounds like a good idea.

It seems I cannot change the description, just the title.
Should I open a new bug report?

As an aside, you said about the encoding of the result of Regexp.escape:

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#9 [ruby-core:123944]

thyresias (Thierry Lambert) wrote in #note-8:

jeremyevans0 (Jeremy Evans) wrote in #note-7:

Sure, that sounds like a good idea.

It seems I cannot change the description, just the title.
Should I open a new bug report?

Updating just the title is fine. I don't think you need to open a new bug report.

As an aside, you said about the encoding of the result of Regexp.escape:

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...

The related line was last changed in 0f4199fb56ec12dae32a6fa099f15aaa7e55d10f. However, that appears to be a bug fix, and even before that, the function was designed to return US-ASCII for ASCII-only strings. Looks like the actual change was made in b2e60b2ce7a7cbcb8a67ac78606a18d3c2591d81. The reasoning given:

      (rb_reg_quote): return ascii-8bit string if the argument is
      ascii-only to generate encoding generic regexp if possible.

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#10 [ruby-core:123945]

Ok.
Here is the code that shows the inconsistency Regexp/String for interpolation, from your examples:

# inconsistent Regexp/String interpolation behavior

prefix = '\p{In_Arabic}'
suffix = '\p{In_Arabic}'.encode('US-ASCII')

begin
  re = /#{prefix}#{suffix}/
rescue => ex
  puts "fail: #{ex.message} (#{ex.class})"
  # fail: encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
end

s = "#{prefix}#{suffix}"
re = /#{s}/
puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})"
# ok: "\\p{In_Arabic}\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}\p{In_Arabic}/ (UTF-8)

begin
  re = /#{suffix}/
rescue => ex
  puts "fail: #{ex.message} (#{ex.class})"
# fail: invalid character property name {In_Arabic}: /\p{In_Arabic}/ (RegexpError)
end

s = "#{suffix}"
re = /#{s}/
puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})"
# ok: "\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}/ (UTF-8)

Updated by naruse (Yui NARUSE) 3 months ago Actions
Copy link
#11 [ruby-core:124136]

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8

This behavior looks a bug.

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#12 [ruby-core:124210]

Right, I think Regexp interpolation should be closer to String interpolation, currently it's its own separate thing with rather weird rules.
It reminds me of some other issues related to Regexp interpolation like #20407 and linked issues.

Updated by augustingbpe (Augustin Gottlieb) 17 days ago Actions
Copy link
#13 [ruby-core:124884]

Hi everyone, I tried to give it a try to fix this issue on this PR, I hope it helps and also to get deeper into the issue, all the tests are passing

https://github.com/ruby/ruby/pull/16224

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #21709

Regexp interpolation is inconsistent with String interpolation

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#1 [ruby-core:123895]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#2 [ruby-core:123896]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#3 [ruby-core:123897]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#4 [ruby-core:123898]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#5 [ruby-core:123899]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#6 [ruby-core:123903]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#7 [ruby-core:123909]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#8 [ruby-core:123931]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#9 [ruby-core:123944]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#10 [ruby-core:123945]

Updated by naruse (Yui NARUSE) 3 months ago Actions
Copy link
#11 [ruby-core:124136]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#12 [ruby-core:124210]

Updated by augustingbpe (Augustin Gottlieb) 17 days ago Actions
Copy link
#13 [ruby-core:124884]

Project

General

Profile

Ruby

Custom queries

Bug #21709

Regexp interpolation is inconsistent with String interpolation

Updated by jeremyevans0 (Jeremy Evans) 4 months ago ActionsCopy link #1 [ruby-core:123895]

Updated by thyresias (Thierry Lambert) 4 months ago ActionsCopy link #2 [ruby-core:123896]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago ActionsCopy link #3 [ruby-core:123897]

Updated by thyresias (Thierry Lambert) 4 months ago ActionsCopy link #4 [ruby-core:123898]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago ActionsCopy link #5 [ruby-core:123899]

Updated by thyresias (Thierry Lambert) 4 months ago ActionsCopy link #6 [ruby-core:123903]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago ActionsCopy link #7 [ruby-core:123909]

Updated by thyresias (Thierry Lambert) 4 months ago ActionsCopy link #8 [ruby-core:123931]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago ActionsCopy link #9 [ruby-core:123944]

Updated by thyresias (Thierry Lambert) 4 months ago ActionsCopy link #10 [ruby-core:123945]

Updated by naruse (Yui NARUSE) 3 months ago ActionsCopy link #11 [ruby-core:124136]

Updated by Eregon (Benoit Daloze) 3 months ago ActionsCopy link #12 [ruby-core:124210]

Updated by augustingbpe (Augustin Gottlieb) 17 days ago ActionsCopy link #13 [ruby-core:124884]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#1 [ruby-core:123895]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#2 [ruby-core:123896]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#3 [ruby-core:123897]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#4 [ruby-core:123898]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#5 [ruby-core:123899]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#6 [ruby-core:123903]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#7 [ruby-core:123909]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#8 [ruby-core:123931]

Updated by jeremyevans0 (Jeremy Evans) 4 months ago Actions
Copy link
#9 [ruby-core:123944]

Updated by thyresias (Thierry Lambert) 4 months ago Actions
Copy link
#10 [ruby-core:123945]

Updated by naruse (Yui NARUSE) 3 months ago Actions
Copy link
#11 [ruby-core:124136]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#12 [ruby-core:124210]

Updated by augustingbpe (Augustin Gottlieb) 17 days ago Actions
Copy link
#13 [ruby-core:124884]