Project

General

Profile

Bug #15979

URI.parse does not validate components

Added by singpolyma (Stephen Paul Weber) 3 months ago. Updated 3 days ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:93505]

Description

URI.parse("https://-._~%2C!$&'()*+,;=:@-._~%2C!$&'()*+,;=:/foo?/-._~%2C!$&'()*+,;=:@/?")

happily return a URI::HTTPS object, even though it has an invalid component and cannot be constructed using URI::HTTPS.build

This is because the parser uses the undocumented initializer which defaults to not validating the components. I would suggest to send that initializer the flag to allow validation or to use the build method instead from the parser.


Files

uri-parse-validate-15979.patch (3.42 KB) uri-parse-validate-15979.patch jeremyevans0 (Jeremy Evans), 10/10/2019 10:47 PM

History

Updated by jeremyevans0 (Jeremy Evans) 3 days ago

This is not a bug, and not related to validation. The reason for the behavior is that URI.parse uses an RFC 3986 parser, while URI::HTTPS.build uses an RFC 2396 parser. If you use URI::HTTPS.new with an RFC 3986 parser and specify to validate the components, you get a valid URI:

URI::HTTPS.new(
  *URI::RFC3986_PARSER.split(
    "https://-._~%2C!$&'()*+,;=:@-._~%2C!$&'()*+,;=:/foo?/-._~%2C!$&'()*+,;=:@/?"),
  URI::RFC3986_PARSER, true)

The issue here is that the hostname you provide in the URI is invalid in RFC 2396 but valid in RFC 3986.

RFC 2396 ABNF:

host          = hostname | IPv4address
hostname      = *( domainlabel "." ) toplabel [ "." ]
domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

RFC 3986 ABNF:

host          = IP-literal / IPv4address / reg-name
reg-name      = *( unreserved / pct-encoded / sub-delims )
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                    / "*" / "+" / "," / ";" / "="

With the URI provided, the host is -._~%2C!$&'()*+,;=, which is valid according to the RFC 3986 ABNF:

- : unreserved
. : unreserved
_ : unreserved
~ : unreserved
%2C : pct-encoded
! : sub-delims
$ : sub-delims
& : sub-delims
' : sub-delims
( : sub-delims
) : sub-delims
* : sub-delims
+ : sub-delims
, : sub-delims
; : sub-delims
= : sub-delims

As to why RFC 3986 is used in some places (parse/join/split) and RFC 2396 (all other places) is used in others, I believe it is related to backwards compatibility. Previously, There were some issues with [ and ] not being allowed in query parts in RFC 3986 (#10402), but those are now worked around. However, URI::RFC2396_Parser and URI::RFC3986_Parser are not API compatible, so you cannot simply swap one for the other without breaking things.

In case you or someone else is interested in changing the default parser, attached is a minimal patch to make the RFC 3986 parser the default. It passes the URI tests, but I haven't done any testing beyond that. Hopefully it provides a decent starting point.

Also available in: Atom PDF