Bug #21709
openRegexp interpolation is inconsistent with String interpolation
Description
%w(foo être).each do |s|
puts "string: #{s.inspect} -> #{s.encoding}"
puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}"
end
Output:
string: "foo" -> UTF-8
escaped: "foo" -> US-ASCII
string: "être" -> UTF-8
escaped: "être" -> UTF-8
The result should always match the encoding of the argument.
Updated by jeremyevans0 (Jeremy Evans) 24 days ago
- Status changed from Open to Feedback
This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):
if (ascii_only) {
rb_enc_associate(tmp, rb_usascii_encoding());
}
US-ASCII strings will be automatically converted to UTF-8 if necessary:
("foo".encode("US-ASCII") + "\u1234").encoding
# => #<Encoding:UTF-8>
Does this behavior cause any problems in your application?
Updated by thyresias (Thierry Lambert) 24 days ago
Does this behavior cause any problems in your application?
Yes:
search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
Updated by jeremyevans0 (Jeremy Evans) 24 days ago
- Status changed from Feedback to Open
thyresias (Thierry Lambert) wrote in #note-2:
Does this behavior cause any problems in your application?
Yes:
search_text = "foo" s_search = Regexp.escape(search_text) re_prefix = /\p{In_Arabic}.+ / s_search.prepend re_prefix.source _re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
Thank you for providing an example. This seems more like an issue with the literal Regexp support in general than with Regexp.escape. You can trigger the issue without Regexp.escape:
re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8
It seems to require you specify unicode properties inside an interpolated string that isn't in UTF-8.
You get a different error without that unicode character at the end:
re = /#{"\\p{In_Arabic}".encode("US-ASCII")}/
# invalid character property name {In_Arabic}: /\p{In_Arabic}/
Using Regexp.new instead of a literal Regexp may work around the issue:
search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = Regexp.new("^#{s_search}|(?<=– |: )#{s_search}")
Updated by thyresias (Thierry Lambert) 24 days ago
Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^
Updated by jeremyevans0 (Jeremy Evans) 24 days ago
thyresias (Thierry Lambert) wrote in #note-4:
Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^
I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.
In general, US-ASCII strings are implicitly convertible to UTF-8 strings, so having Regexp.escape return a US-ASCII string for data that is solely US-ASCII is reasonable. This implicit use of US-ASCII happens in other cases:
# Literal Symbol
$ ruby -e "p :a.encoding"
#<Encoding:US-ASCII>
# Array#join
$ ruby -e "p [].join.encoding"
#<Encoding:US-ASCII>
# Literal Regexp
$ ruby -e "p //.encoding"
#<Encoding:US-ASCII>
Updated by thyresias (Thierry Lambert) 23 days ago
jeremyevans0 (Jeremy Evans) wrote in #note-5:
I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in
Regexp.escape.
Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?
Updated by jeremyevans0 (Jeremy Evans) 23 days ago
thyresias (Thierry Lambert) wrote in #note-6:
Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?
Sure, that sounds like a good idea.
Updated by thyresias (Thierry Lambert) 20 days ago
- Subject changed from Inconsistent encoding by Regexp.escape to Regexp interpolation is inconsistent with String interpolation
jeremyevans0 (Jeremy Evans) wrote in #note-7:
Sure, that sounds like a good idea.
It seems I cannot change the description, just the title.
Should I open a new bug report?
As an aside, you said about the encoding of the result of Regexp.escape:
This is not a bug, it is deliberate behavior for ASCII-only strings in
rb_reg_quote(internal function called byRegexp.escape):
What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...
Updated by jeremyevans0 (Jeremy Evans) 20 days ago
thyresias (Thierry Lambert) wrote in #note-8:
jeremyevans0 (Jeremy Evans) wrote in #note-7:
Sure, that sounds like a good idea.
It seems I cannot change the description, just the title.
Should I open a new bug report?
Updating just the title is fine. I don't think you need to open a new bug report.
As an aside, you said about the encoding of the result of
Regexp.escape:This is not a bug, it is deliberate behavior for ASCII-only strings in
rb_reg_quote(internal function called byRegexp.escape):What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...
The related line was last changed in 0f4199fb56ec12dae32a6fa099f15aaa7e55d10f. However, that appears to be a bug fix, and even before that, the function was designed to return US-ASCII for ASCII-only strings. Looks like the actual change was made in b2e60b2ce7a7cbcb8a67ac78606a18d3c2591d81. The reasoning given:
(rb_reg_quote): return ascii-8bit string if the argument is
ascii-only to generate encoding generic regexp if possible.
Updated by thyresias (Thierry Lambert) 20 days ago
Ok.
Here is the code that shows the inconsistency Regexp/String for interpolation, from your examples:
# inconsistent Regexp/String interpolation behavior
prefix = '\p{In_Arabic}'
suffix = '\p{In_Arabic}'.encode('US-ASCII')
begin
re = /#{prefix}#{suffix}/
rescue => ex
puts "fail: #{ex.message} (#{ex.class})"
# fail: encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
end
s = "#{prefix}#{suffix}"
re = /#{s}/
puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})"
# ok: "\\p{In_Arabic}\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}\p{In_Arabic}/ (UTF-8)
begin
re = /#{suffix}/
rescue => ex
puts "fail: #{ex.message} (#{ex.class})"
# fail: invalid character property name {In_Arabic}: /\p{In_Arabic}/ (RegexpError)
end
s = "#{suffix}"
re = /#{s}/
puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})"
# ok: "\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}/ (UTF-8)
Updated by naruse (Yui NARUSE) 7 days ago
re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8
This behavior looks a bug.
Updated by Eregon (Benoit Daloze) 3 days ago
Right, I think Regexp interpolation should be closer to String interpolation, currently it's its own separate thing with rather weird rules.
It reminds me of some other issues related to Regexp interpolation like #20407 and linked issues.