Misc #17309
openURI.escape being deprecated, yet there is no replacement
Description
I'm on ruby 2.7.2 . The moment I do
uri = "http://bücher.ch"
URI.escape uri
(irb):5: warning: URI.escape
"http://b%C3%BCcher.ch"
I get that warning. Rubocop also tells me:
"""
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
"""
However, none of the suggestions does the same as URI.escape
.
CGI.escape uri
=> "http%3A%2F%2Fb%C3%BCcher.ch"
URI.encode_www_form_component uri
=> "http%3A%2F%2Fb%C3%BCcher.ch"
URI.encode_www_form uri
Traceback (most recent call last):
NoMethodError (undefined method `map' for "http://bücher.ch":String)
Did you mean? tap
So my question is: why is this being deprecated? And if there's still reason, what to exactly replace it for, so I can keep the exact same behaviour?
Updated by jeremyevans0 (Jeremy Evans) almost 4 years ago
Maybe @naruse (Yui NARUSE) can describe the reason it was deprecated over 11 years ago in 238b979f1789f95262a267d8df6239806f2859cc. In my opinion, as the one who changed the deprecation warning from verbose mode to always in 2.7 and removed the method in 3.0, it has various issues. The API is a bit too easy to misuse, as URI.escape(' ', 'UTF-8')
returns ' '
. It doesn't escape URL query parameters as you would expect: URI.escape(' ')
is '%20'
and not '+'
. It uses an RFC 2396 parser and not an RFC 3986 parser, so URI.escape('[]')
is '[]'
and not '%5B%5D'
. Both CGI.escape
and URI.encode_www_form_component
are probably better general-purpose escaping methods. I generally prefer CGI.escape
as it is written in C and should be significantly faster.
Can you explain why "http%3A%2F%2Fb%C3%BCcher.ch"
is invalid in your use case?
For exactly the same behavior you can use URI::DEFAULT_PARSER.escape(str)
.
Updated by chucke (Tiago Cardoso) almost 4 years ago
Hi Jeremy, thx for the context on the inconsistencies, that's pretty useful info.
Can you explain why "http%3A%2F%2Fb%C3%BCcher.ch" is invalid in your use case?
My specific use-case is for supporting IDN domain names for HTTP requests in httpx
, the HTTP client library I maintain (of which "bücher.ch" is an example).
Because this domain is not ascii, in order to resolve it, I have to first convert it into punycode (you can use this website (https://www.punycoder.com/) to see the translation).
When using httpx
, a user will pass the full request URL: "http://bücher.ch" (which I know, it's not a valid URL, because it's not ASCII), so I need to, first, isolate the "host" part of this URL (or IRL), convert it to "punycode", perform the DNS resolution, then perform the HTTP request with the "host/authority" header set to "bücher.ch". For a full functional demonstration, you can do the request with cURL and analyse it yourself.
This thread I started is all about the first step, "isolating the host part of the IRL". Because the "uri" library doesn't work with IRLs , my workaround is, whenever the URL is not ascii, to:
- use
URI.escape
to escape the domain into something URI can parse; - use URI() to parse into a URI::HTTP object;
- get host, URI.unescape it, "punycode" it;
- carry forward both domains, to perform DNS and HTTP requests;
If I use CGI.escape
or any of the suggested alternatives, the resulting escaped string isn't a URL the "uri" library can parse. And this is why I needed the deprecated URI.escape
.
Updated by shyouhei (Shyouhei Urabe) almost 4 years ago
Re: the reason why URI.escape
is deprecated and also why there is no transition path:
URI, in general, is a "structured" construct. It has schema, host, path, query, fragment, and each of them have different ways to escape a character.
OTOH when you do URI.escape
, it's because you want to make an invalid URL valid, like you mentioned. This means the input string's structure is broken. Yet, you have to find a right structure inside of the broken string and properly apply escape sequences for each of them. This is, simply, impossible. What is broken is broken. You can't fix one.
So in short what was bad about URI.escape
is the idea of escaping a broken URL itself. This is why it is deprecated and there will be no alternative.
Updated by shyouhei (Shyouhei Urabe) almost 4 years ago
"But my browser can take UTF-8 URLs!", you might wonder. The reality is they no longer honour what RFCs say. Modern browsers follow other standard https://url.spec.whatwg.org/ which has a very clear language that the URL they define must accept UTF-8. As far as browsers go with WHATWG URL, there is no need for escapnig to creep in.
Updated by chucke (Tiago Cardoso) almost 4 years ago
I may not be in position of the full details of the decision, and what is exactly the full of the deprecated URI.escape . I do have seen previous discussions on what is a URI, what's stated in the RFC, and what's the scope of the uri
library. So far, so good.
About the "my browser supports UTF-8 URLs", I think they also do what I advocate: encode the individual components of the URL into ASCII, and then DNS/TCP/HTTP it. I don't think they support something different, and they could not, for interoperability reasons. They abstract the details, though.
I wonder why uri
can't do the same. punycode IDN domains are a standard, and although uri
doesn't support it, it could at least return a URI object with the ASCII-encoded components. The same could be said of any UTF-8 "browser URI", btw. uri
could handle that complexity for the user, IMO. And it's more than "make an invalid URL valid"; net-http
also allows me to pass UTF-8 strings as HTTP headers, although everything get proper encoded before going over the wire, thereby handling that implementation detail.
This is my current workaround: https://gitlab.com/honeyryderchuck/httpx/-/blob/master/lib/httpx/utils.rb#L28-41
Updated by shyouhei (Shyouhei Urabe) almost 4 years ago
chucke (Tiago Cardoso) wrote in #note-5:
I wonder why
uri
can't do the same. punycode IDN domains are a standard, and althoughuri
doesn't support it, it could at least return a URI object with the ASCII-encoded components. The same could be said of any UTF-8 "browser URI", btw.uri
could handle that complexity for the user, IMO. And it's more than "make an invalid URL valid";net-http
also allows me to pass UTF-8 strings as HTTP headers, although everything get proper encoded before going over the wire, thereby handling that implementation detail.
This part I agree. There should be a way to follow the current best practice, not dinosaur RFC. And that is not just a "make an invalid URL valid" business (which is what URI.escape
is expected to do).
Updated by byroot (Jean Boussier) over 2 years ago
- Related to Feature #18593: Add back URI.escape added