Feature #21617
openAdd Internationalized Domain Name (IDN) support to URI
Description
Originally proposed by @chucke at https://github.com/ruby/uri/issues/76, trying to formalize it here.
Context¶
Internalized Domain Names, are getting more common, yet Ruby's uri
default gem has no support for it:
>> URI("https://日本語.jp/")
URI must be ascii only "https://\u65E5\u672C\u8A9E.jp/" (URI::InvalidURIError)
So any program that which to handle arbitrary valid URIs provided by users can't use the uri
gem, and instead have to depend on third party gems like addressable
>> Addressable::URI.parse("https://日本語.jp/")
=> #<Addressable::URI:0xd648 URI:https://日本語.jp/>
But even there, it won't seamlessly work with other libraries such as net-http
:
>> Net::HTTP.get(Addressable::URI.parse("https://日本語.jp/")).bytesize
OpenSSL::SSL::SSLSocket#connect_nonblock': SSL_connect returned=1 errno=0 peeraddr=[2001:218:3001:7::110]:443 state=error: ssl/tls alert handshake failure (SSL alert number 40) (OpenSSL::SSL::SSLError)
You have to explicitly normalize the URL:
>> Addressable::URI.parse("https://日本語.jp/").normalize
=> #<Addressable::URI:0x130d0 URI:https://xn--wgv71a119e.jp/>
>> Net::HTTP.get(Addressable::URI.parse("https://日本語.jp/").normalize).bytesize
=> 8703
Feature Request¶
I believe it's would be very useful if the default uri
gem had the capacity of:
- Parsing IDNA domain names.
- Convert URLs between their unicode and ASCII forms.
The URI::Generic
class already have a #normalize
method to ensure the host and schema parts are all lower case, it could be extended to encode IDN hosts into their ASCII equivalent.
It would also be useful if the opposite operation was supported for display purposes, not sure what name such a method could have, perhaps canonicalize
?
Implementation¶
In https://github.com/ruby/uri/issues/76 @skryukov pointed to his pure Ruby implementation of IDNA 2008 (https://github.com/skryukov/uri-idna), I believe it would be good to upstream parts of it in the uri
gem to implement these feature.
Updated by byroot (Jean Boussier) 21 days ago
- Description updated (diff)
Updated by byroot (Jean Boussier) 21 days ago
@skryukov also pointed to me the existence of https://github.com/y-yagi/uri-whatwg_parser by @y-yagi
Updated by chucke (Tiago Cardoso) 21 days ago
Just adding my original public API suggestions, for visibility and further discussion by the core team.
I propose that URI::Generic
supports punycode decoding OOTB by relying on the current behaviour of URI::Generic#hostname
, which already applies transformations to the passed host
when necessary, such as in the below case of IPv6 addresses:
# the example above is inspired in how uri already handles IPv6 addresses
uri = URI("https://[::1]")
uri.host #=> "[::1]", cannot be used in Socket.new(host, port)
uri.hostname #=> "::1", can be used in Socket.new(host, port)
therefore, punycode translation would happen transparently for IDNAs when calling hostname
:
uri = "https://l♥️h.ws"
uri = URI(uri)
uri.host #=> "l♥️h.ws" #=> cannot be used in Socket.new(host, port)
uri.hostname #=> "xn--lh-t0xz926h.ws" #=> can be used in Socket.new(host, port), which will perform DNS via getaddrinfo
This would require very little change in resolv
library, before issuing the DNS query. The same would apply for most use cases, I believe.
The required punycode decoding logic could be implemented in a separate URI::Punycode
module. This module could be exposed publicly, with a single public method, decode(uri)
, which would return the punycode URI of a given IDNA. This API could be extended to support more advanced use cases beyond the main common use case (which URI::Generic#hostname
should address), like the ones documented here.
Updated by adrienjarthon (Adrien Jarthon) 19 days ago
Thanks for this suggestion, I've been trying to improve Adressable's support for a few years (https://github.com/sporkmonger/addressable/issues/491, https://github.com/sporkmonger/addressable/issues?q=author%3Ajarthod) and if it can be done directly inside Ruby it's even better I suppose. At the moment i'm currently using Adressable with libidn2 inside my code as it's the most compliant option, but this is a branch of mine which is still not merged (https://github.com/sporkmonger/addressable/pull/496).
Since then https://github.com/skryukov/uri-idna also came out with a good ruby implementation I believe (I haven't tested it).
@chucke minor comment: xn--lh-t0xz926h.ws
is actually not valid in your example, it should be xn--lh-t0x.ws
(see https://github.com/sporkmonger/addressable/issues/491)
If this proposal for Ruby is accepted, I can probably help with it :) (implementation, testing, etc..)
Updated by naruse (Yui NARUSE) 17 days ago
I agree the direction that URI supports IDN.
But there are some barriers to be solved:
- IDN Library
- IDN needs some logic and tables including punycode, nameprep, and some data tables as far as remember
- URI's argment
- URI.parse's argument is URI. To support IDN, the argment needs to be changed
IDN Library¶
libidn2 is the famous library. But it introduces one more external dependency.
Using pure ruby implementation for this is good idea to avoid the dependency problem.
URI's argment¶
Introducing WHATWG Parser is an option.
I agree the direction to adopt WHATWG Parser by uri library.
But in this ticket, just allowing IDN is also a good option to minimize the discussion.
Updated by Eregon (Benoit Daloze) 14 days ago
- Description updated (diff)