Bug #14586
closedURI::RFC2396_Parser#unescape raises an exception if the input is mixed Unicode and percent-escapes
Description
Currently, the following test case passes:
def test_unescape
p1 = URI::Parser.new
assert_equal("\xe3\x83\x90", p1.unescape("\xe3\x83\x90"))
assert_equal("\xe3\x83\x90", p1.unescape('%e3%83%90'))
end
But the following raises Encoding::CompatibilityError
:
def test_unescape
p1 = URI::Parser.new
assert_equal("\xe3\x83\x90", p1.unescape("\xe3\x83\x90"))
assert_equal("\xe3\x83\x90", p1.unescape('%e3%83%90'))
assert_equal("\xe3\x83\x90\xe3\x83\x90", p1.unescape("\xe3\x83\x90%e3%83%90"))
end
The issue is in the definition of URI::RFC2396_Parser#unescape
:
def unescape(str, escaped = @regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end
The behaviour is as follows:
- If the
String
contains only ASCII characters (including percent-escapes), then substituting each result of[$&[1, 2].hex].pack('C')
(which returns ASCII-8BIT) succeeds, because theString
so far is safely coerced toASCII-8BIT
to let the concatenation work. - If the
String
contains only ASCII + Unicode characters (and no percent-escapes), thengsub
matches nothing. - If the
String
contains both, however, then attempting togsub
individual ASCII-8BIT characters (which can't be coerced to UTF-8) fails withEncoding::CompatibilityError
.
This patch:
- Adds the test.
- Records the original encoding of the input string, forces the encoding to ASCII-8BIT for the
gsub
, then forces the encoding back aftergsub
. If the percent-encoded characters aren't valid in the original encoding, that's up to the user, but this is better than just refusing to perform the unescape at all. - Corrects a minor doc mismatch.
Thanks to @tenderlovemaking (Aaron Patterson) for helping me find this in upstream Ruby, who suggested @naruse (Yui NARUSE) might be interested in reviewing this patch. We currently run with this monkey-patched at GitHub, and in Rails master
.
Files
Updated by mame (Yusuke Endoh) almost 7 years ago
I'm unfamiliar with encoding, so this is just my impression: str.dup.force_encoding(Encoding::ASCII_8BIT)
always creates a string object, so str.gsub(re) { [$&[1, 2].hex].pack('C').force_encoding(str.encoding) }
might be better?
(And, this is not directly related to this ticket, I'm unsure how the "escaped" optional parameter is helpful.)
Updated by naruse (Yui NARUSE) almost 7 years ago
- Status changed from Open to Closed
Applied in changeset trunk|r62695.
fix error if the input is mixed Unicode and percent-escapes
Reported by kivikakk (Ashe Connor) with tests and doc fix
Patch based on mame and fix by naruse
[Bug #14586]