Bug #8352
closedURI squeezes a sequence of slashes in merging paths when it shouldn't
Description
RFC 2396 (on which the library currently is based) or RFC 3986 says nothing about a sequence of slashes in the path part except for parsing rules when a URI (path) starts with two slashes.
It should be perfectly valid to have a slash right after another, and there is no reason to "normalize" a sequence of slashes into a single slash, which uri actually does in merging paths:
URI.parse('http://example.com/foo//bar/')+'.'
=> #<URI::HTTP:0x0000080303d2b0 URL:http://example.com/foo/bar/>
Fixing this may be as easy as changing the regexp in URI::Generic#split_path from %r{/+} to %r{/}, but I wonder how the impact of incompatibility it may introduce would be.
Files
Updated by knu (Akinori MUSHA) over 11 years ago
s/RFC 2896/RFC 2396/
Updated by naruse (Yui NARUSE) about 10 years ago
- Description updated (diff)
Updated by knu (Akinori MUSHA) almost 7 years ago
- Subject changed from uri squeezes a sequence of slashes in merging paths when it shouldn't to URI squeezes a sequence of slashes in merging paths when it shouldn't
- Description updated (diff)
- Backport deleted (
1.9.3: UNKNOWN, 2.0.0: UNKNOWN)
Updated by knu (Akinori MUSHA) almost 7 years ago
Addressable::URI (of the addressable gem) properly preserves sequences of slashes in a path, so it is a workaround to use it instead.
I've confirmed that net/url
of Go, URI
of Perl, urlparse.urljoin
of Python2 or java.net.URL
of Java never does this kind of unwanted normalization.
A single exception I could find, however, was urllib.parse
of Python3. (!)
% python3
Python 3.6.3 (default, Nov 4 2017, 01:15:26)
[GCC 4.2.1 Compatible FreeBSD Clang 3.8.0 (tags/RELEASE_380/final 262564)] on freebsd11
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://example.com/foo//bar/baz', '.')
'http://example.com/foo/bar/'
I'm not sure if this is an intentional change from Python2, but I believe any slash in the path part should be retained.
Updated by knu (Akinori MUSHA) almost 7 years ago
I've also checked the url
module of node.js and it didn't, neither. Their test cases do not include explicit examples of how to deal with sequences of slashes in a path, but there are some occurrences of double-slash retained in the expected results of relative path resolution, which means double-slash is not a subject of squeezing.
Looking into WHATWG URL spec, there's no indication that a sequence of slashes in a URL path should be treated specially. A path is simply a "list" of "items" separated with the slash (/, U+002F) and any item can naturally be an empty string. Even when resolving a "double-dot segment" and consequently "removing" a path "item" you are never told to "remove" extra items that are empty.
So, as you can see, Ruby and Python3 are the only exceptions, there's no specification that indicates that a sequence of slashes in a URL path should be treated specially, and the majority of library implementations found in other languages supports that. I presume there are few programmers who would rely on the current behavior.
Updated by duerst (Martin Dürst) almost 7 years ago
knu (Akinori MUSHA) wrote:
I presume there are few programmers who would rely on the current behavior.
I agree that there should be few programmers who would rely on subsequent slashes to be collapsed to a single slash. However, I also think it's a bad idea for programmers or users to rely on multiple consecutive slashes to be preserved. Using multiple consecutive slashes in an URI is a bad idea.
Updated by phluid61 (Matthew Kerwin) almost 7 years ago
duerst (Martin Dürst) wrote:
Using multiple consecutive slashes in an URI is a bad idea.
It definitely doesn't play nicely with dot-segment resolution, but then I wouldn't want to bear the burden of deciding how to resolve that, one way or the other.
In this particular case, I think it is incorrect to automatically remove empty segments, but I also think it's bad to have them in the first place.
What if there was a way for the programmer to explicitly invoke the current behaviour (e.g. by sending a different message), so the side-effect is expected?
Updated by knu (Akinori MUSHA) almost 7 years ago
- File 0001-Allow-empty-path-components-in-a-URI-Bug-8352.patch 0001-Allow-empty-path-components-in-a-URI-Bug-8352.patch added
- Assignee changed from akira (akira yamada) to naruse (Yui NARUSE)
Naruse-san, could you review the attached patch?
Updated by knu (Akinori MUSHA) almost 7 years ago
- Status changed from Open to Closed
Applied in changeset trunk|r61218.
Allow empty path components in a URI [Bug #8352]
- generic.rb (URI::Generic#merge, URI::Generic#route_to): Fix a bug
where a sequence of slashes in the path part gets collapsed to a
single slash. According to the relevant RFCs and WHATWG URL
Standard, empty path components are simply valid and there is no
special treatment defined for them, so we just keep them as they
are.
Updated by jeremyevans0 (Jeremy Evans) over 5 years ago
- Has duplicate Bug #12562: URI merge removes empty segment contrary to RFC 3986 added