Bug #19196
closedThe string saved to Tempfile from URI.open escapes "&" character
Description
When I am reading the string response from a URI.open, the response is not equivalent to the response body.
How to reproduce:
url = "https://www.podcastone.com/podcast?categoryID2=1237"
handle = URI.open(url)
=> #<Tempfile:/path/to/tempfile>
puts handle.read
.... https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309...
In the browser, the actual string reads:
https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309
Notice the characters #38;
My initial research is that it's because the Tempfile that gets created is in ascii-8bit, and in ascii-8bit, the amperstand is a "38".
I propose that we should have a way to force the encoding of the Tempfile to UTF8 so that this character is not escaped and the string encoding is preserved.
Updated by westoque (William Estoque) almost 2 years ago
- Subject changed from The string saved to Tempfile from URI.open escapes "&" characters to The string saved to Tempfile from URI.open escapes "&" character
Updated by westoque (William Estoque) almost 2 years ago
- Description updated (diff)
Updated by westoque (William Estoque) almost 2 years ago
- Description updated (diff)
Updated by ufuk (Ufuk Kayserilioglu) almost 2 years ago
The content you are reading is XML and &
characters are there because of XML-escaping. They are not related to any kind of file encoding, ASCII-8BIT
or UTF-8
.
Moreover, they are there in the response from the server, which you can see by looking at the output of curl
for the same resource:
$ curl -s "https://www.podcastone.com/podcast?categoryID2=1237" | grep "aw.noxsolutions.com/launchpod/adswizz/1237/762-"
...
<enclosure length="74614442" type="audio/mpeg" url="https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&adwNewID3=true&awNetwork=309"></enclosure>
...
So, this is not a Ruby problem at all. On the contrary, Ruby can help you unescape these characters:
require "cgi"
CGI.unescapeHTML("foo&bar") # => "foo&bar"
Updated by Eregon (Benoit Daloze) almost 2 years ago
- Status changed from Open to Rejected
Updated by westoque (William Estoque) almost 2 years ago
@ufuk (Ufuk Kayserilioglu) thank you for that explanation. I may have jumped to conclusions when checking that response in the browser (Chrome) vs curl which unescaped the characters.