Bug #21709
openInconsistent encoding by Regexp.escape
Description
%w(foo être).each do |s|
puts "string: #{s.inspect} -> #{s.encoding}"
puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}"
end
Output:
string: "foo" -> UTF-8
escaped: "foo" -> US-ASCII
string: "être" -> UTF-8
escaped: "être" -> UTF-8
The result should always match the encoding of the argument.
Updated by jeremyevans0 (Jeremy Evans) about 11 hours ago
- Status changed from Open to Feedback
This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):
if (ascii_only) {
rb_enc_associate(tmp, rb_usascii_encoding());
}
US-ASCII strings will be automatically converted to UTF-8 if necessary:
("foo".encode("US-ASCII") + "\u1234").encoding
# => #<Encoding:UTF-8>
Does this behavior cause any problems in your application?
Updated by thyresias (Thierry Lambert) about 10 hours ago
Does this behavior cause any problems in your application?
Yes:
search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
Updated by jeremyevans0 (Jeremy Evans) about 9 hours ago
- Status changed from Feedback to Open
thyresias (Thierry Lambert) wrote in #note-2:
Does this behavior cause any problems in your application?
Yes:
search_text = "foo" s_search = Regexp.escape(search_text) re_prefix = /\p{In_Arabic}.+ / s_search.prepend re_prefix.source _re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)
Thank you for providing an example. This seems more like an issue with the literal Regexp support in general than with Regexp.escape. You can trigger the issue without Regexp.escape:
re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8
It seems to require you specify unicode properties inside an interpolated string that isn't in UTF-8.
You get a different error without that unicode character at the end:
re = /#{"\\p{In_Arabic}".encode("US-ASCII")}/
# invalid character property name {In_Arabic}: /\p{In_Arabic}/
Using Regexp.new instead of a literal Regexp may work around the issue:
search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = Regexp.new("^#{s_search}|(?<=– |: )#{s_search}")
Updated by thyresias (Thierry Lambert) about 7 hours ago
Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^
Updated by jeremyevans0 (Jeremy Evans) about 6 hours ago
thyresias (Thierry Lambert) wrote in #note-4:
Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^
I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.
In general, US-ASCII strings are implicitly convertible to UTF-8 strings, so having Regexp.escape return a US-ASCII string for data that is solely US-ASCII is reasonable. This implicit use of US-ASCII happens in other cases:
# Literal Symbol
$ ruby -e "p :a.encoding"
#<Encoding:US-ASCII>
# Array#join
$ ruby -e "p [].join.encoding"
#<Encoding:US-ASCII>
# Literal Regexp
$ ruby -e "p //.encoding"
#<Encoding:US-ASCII>