Project

General

Profile

Actions

Bug #21709

open

Regexp interpolation is inconsistent with String interpolation

Bug #21709: Regexp interpolation is inconsistent with String interpolation

Added by thyresias (Thierry Lambert) 24 days ago. Updated 3 days ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt]
[ruby-core:123894]

Description

%w(foo être).each do |s|
  puts "string: #{s.inspect} -> #{s.encoding}"
  puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}"
end

Output:

string: "foo" -> UTF-8
escaped: "foo" -> US-ASCII
string: "être" -> UTF-8
escaped: "être" -> UTF-8

The result should always match the encoding of the argument.

Updated by jeremyevans0 (Jeremy Evans) 24 days ago Actions #1 [ruby-core:123895]

  • Status changed from Open to Feedback

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

    if (ascii_only) {
        rb_enc_associate(tmp, rb_usascii_encoding());
    }

US-ASCII strings will be automatically converted to UTF-8 if necessary:

("foo".encode("US-ASCII") + "\u1234").encoding
# => #<Encoding:UTF-8>

Does this behavior cause any problems in your application?

Updated by thyresias (Thierry Lambert) 24 days ago Actions #2 [ruby-core:123896]

Does this behavior cause any problems in your application?

Yes:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)

Updated by jeremyevans0 (Jeremy Evans) 24 days ago Actions #3 [ruby-core:123897]

  • Status changed from Feedback to Open

thyresias (Thierry Lambert) wrote in #note-2:

Does this behavior cause any problems in your application?

Yes:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)

Thank you for providing an example. This seems more like an issue with the literal Regexp support in general than with Regexp.escape. You can trigger the issue without Regexp.escape:

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8

It seems to require you specify unicode properties inside an interpolated string that isn't in UTF-8.

You get a different error without that unicode character at the end:

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}/
# invalid character property name {In_Arabic}: /\p{In_Arabic}/

Using Regexp.new instead of a literal Regexp may work around the issue:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = Regexp.new("^#{s_search}|(?<=– |: )#{s_search}")

Updated by thyresias (Thierry Lambert) 24 days ago Actions #4 [ruby-core:123898]

Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^

Updated by jeremyevans0 (Jeremy Evans) 24 days ago Actions #5 [ruby-core:123899]

thyresias (Thierry Lambert) wrote in #note-4:

Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^

I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.

In general, US-ASCII strings are implicitly convertible to UTF-8 strings, so having Regexp.escape return a US-ASCII string for data that is solely US-ASCII is reasonable. This implicit use of US-ASCII happens in other cases:

# Literal Symbol
$ ruby -e "p :a.encoding"
#<Encoding:US-ASCII>

# Array#join
$ ruby -e "p [].join.encoding"
#<Encoding:US-ASCII>

# Literal Regexp
$ ruby -e "p //.encoding"
#<Encoding:US-ASCII>

Updated by thyresias (Thierry Lambert) 23 days ago Actions #6 [ruby-core:123903]

jeremyevans0 (Jeremy Evans) wrote in #note-5:

I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.

Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?

Updated by jeremyevans0 (Jeremy Evans) 23 days ago Actions #7 [ruby-core:123909]

thyresias (Thierry Lambert) wrote in #note-6:

Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?

Sure, that sounds like a good idea.

Updated by thyresias (Thierry Lambert) 20 days ago Actions #8 [ruby-core:123931]

  • Subject changed from Inconsistent encoding by Regexp.escape to Regexp interpolation is inconsistent with String interpolation

jeremyevans0 (Jeremy Evans) wrote in #note-7:

Sure, that sounds like a good idea.

It seems I cannot change the description, just the title.
Should I open a new bug report?

As an aside, you said about the encoding of the result of Regexp.escape:

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...

Updated by jeremyevans0 (Jeremy Evans) 20 days ago Actions #9 [ruby-core:123944]

thyresias (Thierry Lambert) wrote in #note-8:

jeremyevans0 (Jeremy Evans) wrote in #note-7:

Sure, that sounds like a good idea.

It seems I cannot change the description, just the title.
Should I open a new bug report?

Updating just the title is fine. I don't think you need to open a new bug report.

As an aside, you said about the encoding of the result of Regexp.escape:

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...

The related line was last changed in 0f4199fb56ec12dae32a6fa099f15aaa7e55d10f. However, that appears to be a bug fix, and even before that, the function was designed to return US-ASCII for ASCII-only strings. Looks like the actual change was made in b2e60b2ce7a7cbcb8a67ac78606a18d3c2591d81. The reasoning given:

      (rb_reg_quote): return ascii-8bit string if the argument is
      ascii-only to generate encoding generic regexp if possible.

Updated by thyresias (Thierry Lambert) 20 days ago Actions #10 [ruby-core:123945]

Ok.
Here is the code that shows the inconsistency Regexp/String for interpolation, from your examples:

# inconsistent Regexp/String interpolation behavior

prefix = '\p{In_Arabic}'
suffix = '\p{In_Arabic}'.encode('US-ASCII')

begin
  re = /#{prefix}#{suffix}/
rescue => ex
  puts "fail: #{ex.message} (#{ex.class})"
  # fail: encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
end

s = "#{prefix}#{suffix}"
re = /#{s}/
puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})"
# ok: "\\p{In_Arabic}\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}\p{In_Arabic}/ (UTF-8)

begin
  re = /#{suffix}/
rescue => ex
  puts "fail: #{ex.message} (#{ex.class})"
# fail: invalid character property name {In_Arabic}: /\p{In_Arabic}/ (RegexpError)
end

s = "#{suffix}"
re = /#{s}/
puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})"
# ok: "\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}/ (UTF-8)

Updated by naruse (Yui NARUSE) 7 days ago Actions #11 [ruby-core:124136]

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8

This behavior looks a bug.

Updated by Eregon (Benoit Daloze) 3 days ago Actions #12 [ruby-core:124210]

Right, I think Regexp interpolation should be closer to String interpolation, currently it's its own separate thing with rather weird rules.
It reminds me of some other issues related to Regexp interpolation like #20407 and linked issues.

Actions

Also available in: PDF Atom