Bug #19756
openURI::HTTP.build does not accept a host of `_gateway`, but `URI.parse` will.
Description
I noticed a difference in behavior between URI::HTTP.build and URI.parse. URI::HTTP.build will not accept host: value of _gateway, but URI.parse will.
Steps To Reproduce¶
URI::HTTP.build(host: "_gateway")
vs.
URI.parse("http://_gateway")
Expected Results¶
Both raise the same exception, or return the same URI object.
Actual Results¶
URI::HTTP.build(host: "_gateway")
/usr/share/ruby/uri/generic.rb:601:in `check_host': bad component(expected host component): _gateway (URI::InvalidComponentError)
	from /usr/share/ruby/uri/generic.rb:640:in `host='
	from /usr/share/ruby/uri/generic.rb:673:in `hostname='
	from /usr/share/ruby/uri/generic.rb:190:in `initialize'
	from /usr/share/ruby/uri/generic.rb:136:in `new'
	from /usr/share/ruby/uri/generic.rb:136:in `build'
	from /usr/share/ruby/uri/http.rb:61:in `build'
	from (irb):2:in `<main>'
	from /usr/local/share/gems/gems/irb-1.7.0/exe/irb:9:in `<top (required)>'
	from /usr/local/bin/irb:25:in `load'
	from /usr/local/bin/irb:25:in `<main>'
URI.parse("https://_gateway")
# => #<URI::HTTPS https://_gateway>
Additional Information¶
$ gem list uri
uri (default: 0.12.1)
        
           Updated by mame (Yusuke Endoh) over 2 years ago
          Updated by mame (Yusuke Endoh) over 2 years ago
          
          
        
        
      
      Note that underscores are not allowed in host names.
I think it is a reasonable behavior for URI::HTTP.build(host:"_gateway") to raise an exception in order to prevent the generation of invalid URI strings.
It is arguable about the behavior of URI.parse("https://_gateway"). I think it is an acceptable behavior because the invalid URI string has been already created by someone. We may change the behavior to raise an exception, but it will break compatibility.
At present, I prefer to change nothing. If you have a background why you want to change the behavior, please elaborate.
        
           Updated by Dan0042 (Daniel DeLorme) over 2 years ago
          Updated by Dan0042 (Daniel DeLorme) over 2 years ago
          
          
        
        
      
      Maybe underscores are not allowed by some spec, but they are common in the wild. _dmarc.example.com and google._domainkey.example.com are standard subdomains. And many/most DNS servers will happily accept subdomains with underscores.
        
           Updated by shugo (Shugo Maeda) over 2 years ago
          Updated by shugo (Shugo Maeda) over 2 years ago
          
          
        
        
      
      Dan0042 (Daniel DeLorme) wrote in #note-2:
Maybe underscores are not allowed by some spec, but they are common in the wild.
_dmarc.example.comandgoogle._domainkey.example.comare standard subdomains. And many/most DNS servers will happily accept subdomains with underscores.
Such underscored names are described in RFC8522, but is there any use case to use them with URI::HTTP.build?
        
           Updated by Dan0042 (Daniel DeLorme) over 2 years ago
          Updated by Dan0042 (Daniel DeLorme) over 2 years ago
          
          
        
        
      
      shugo (Shugo Maeda) wrote in #note-3:
is there any use case to use them with URI::HTTP.build?
I assume the purpose of URI::HTTP.build is the same as URI.parse but with a hash instead of a string. While writing a crawler I have seen HTTP hostnames with an underscore, that would fail because of URI restrictions, which I had to monkey patch in order to accept the underscore. Since http://not_std.example.com is possible and present in the wild, I think it should be possible to build a URI::HTTP object to represent it, either with .parse or .build. "Be liberal in what you accept."
BTW the same error is raised for URI::Generic::build(host: "_dmarc.example.com") which seems to me like it should be a valid way of storing a DMARC domain.
        
           Updated by austin (Austin Ziegler) over 2 years ago
          Updated by austin (Austin Ziegler) over 2 years ago
          
          
        
        
      
      Dan0042 (Daniel DeLorme) wrote in #note-4:
shugo (Shugo Maeda) wrote in #note-3:
is there any use case to use them with URI::HTTP.build?
I assume the purpose of
URI::HTTP.buildis the same asURI.parsebut with a hash instead of a string. While writing a crawler I have seen HTTP hostnames with an underscore, that would fail because of URI restrictions, which I had to monkey patch in order to accept the underscore. Sincehttp://not_std.example.comis possible and present in the wild, I think it should be possible to build a URI::HTTP object to represent it, either with.parseor.build. "Be liberal in what you accept."BTW the same error is raised for
URI::Generic::build(host: "_dmarc.example.com")which seems to me like it should be a valid way of storing a DMARC domain.
RFC1123 and related RFCs suggest that network reachable hostnames may not have underscores, although they are permitted in informational DNS records.
Strictly disallowing underscores from URI::HTTP.build seems to be correct (I do not know of any hostnames with underscores in them). On the other hand, allowing them in URI::Generic may be permissible, although I would probably want something to flag that I’m explicitly allowing underscores (_dmarc.example.com would IMO be the DNS equivalent of dmarc://example.com in terms of a URI as it refers to the DMARC configuration record for example.com emails).
https://stackoverflow.com/questions/10959757/the-use-of-the-underscore-in-host-names#comment14874905_11206362 suggests that there are widespread systems that Do This Wrong, but whether they can be reached over the network is an entirely different issue.
        
           Updated by austin (Austin Ziegler) over 2 years ago
          Updated by austin (Austin Ziegler) over 2 years ago
          
          
        
        
      
      This is a better thread overall and there are a number of points worth reading in it. It boils down to:
- Underscores are not permitted in hostnames and therefore URLs/URIs.
- Leading underscores are permitted in DNS labels.
- Underscores are not otherwise permitted in DNS labels.
https://stackoverflow.com/questions/2180465/can-domain-name-subdomains-have-an-underscore-in-it
        
           Updated by Dan0042 (Daniel DeLorme) over 2 years ago
          Updated by Dan0042 (Daniel DeLorme) over 2 years ago
          
          
        
        
      
      While all this is technically true and correct, I am not particularly interested in "what is permitted"; I think "what actually exists in the real world out there" is the only thing worth caring about.
The robustness principle is "be conservative in what you do, be liberal in what you accept from others". If there's a website at http://my_god.example.com and ruby cannot connect to it because there's an underscore, then that website fails the first half of robustness principle, and ruby fails the second half.
        
           Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
          Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
          
          
        
        
      
      Dan0042 (Daniel DeLorme) wrote in #note-7:
While all this is technically true and correct, I am not particularly interested in "what is permitted"; I think "what actually exists in the real world out there" is the only thing worth caring about.
The robustness principle is "be conservative in what you do, be liberal in what you accept from others".
The robustness principle should only be used in the case where there is not an official standard. In the cases where there is an official standard, applying the robustness principle to support non-standard implementations is actively harmful and results in systems being made worse by being forced to tolerate bugs in non-standard implementations. Workarounds to tolerate bugs in non-standard implementations can be a source of security vulnerabilities. Additional discussion:
- https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html
- https://queue.acm.org/detail.cfm?id=1999945
There may be cases where non-standard usage is so widespread that you are forced to tolerate it as a de facto standard, but this does not appear to be one of those cases.
That being said, let me summarize my research on this issue. DNS allows underscores in DNS names, but that does not necessarily apply to URLs.  The current URL spec at  https://url.spec.whatwg.org/ does not seem to exclude underscore in host name part of a URL (it could be part of an opaque host).  The current HTTP RFC (RFC 7231) does not seem to exclude them either. If you follow the references:
- https://datatracker.ietf.org/doc/html/rfc7231#section-5.1
- https://datatracker.ietf.org/doc/html/rfc7230#appendix-B (uri-host)
- https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2
While RFC 3986 states that host names are intended for DNS lookup using the syntax in Section 3.5 of RFC1034 and Section 2.1 of RFC1123, it also states
   This specification does not mandate a particular registered name
   lookup technology and therefore does not restrict the syntax of reg-
   name beyond what is necessary for interoperability.
The ABNF syntax given in RFC 3986 is:
  reg-name    = *( unreserved / pct-encoded / sub-delims )
  unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
Which indicates that underscores is a valid character in an HTTP hostname.  There is no requirement that HTTP use DNS for registered name lookup, and therefore it seems reasonable to allow underscores in host names in URI::HTTP.build.
        
           Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
          Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
          
          
        
        
      
      - Related to Bug #19266: URI::Generic should use URI::RFC3986_PARSER instead of URI::DEFAULT_PARSER added