Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not unnecessarily encode colons in URIs/IRIs #14

Open
wouterbeek opened this issue Jul 30, 2017 · 3 comments
Open

Do not unnecessarily encode colons in URIs/IRIs #14

wouterbeek opened this issue Jul 30, 2017 · 3 comments

Comments

@wouterbeek
Copy link
Contributor

The URI library currently encodes colon in the path and in the query component.

Colons in query components

In Semantic Web services it is very common to include IRIs in the query component, e.g., to indicate a selection or query. uri_query_components/2 encodes colons in the query component, even though this is not necessary. In the following example, %3A should simply be :. The # is legitimately encoded as %23, because it would otherwise be confused with the fragment component separator.

uri_query_components(Query, [predicate('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')]).
Query = 'predicate=http%3A//www.w3.org/1999/02/22-rdf-syntax-ns%23type'.

Colons in path components

Colons are not very common in IRIs, but some datasets (e.g., DBpedia) do use them. iri_normalized/2 unnecessarily encodes colons in paths, e.g., translating [1] to [2].

[1]   'http://dbpedia.org/resource/Category:Politics'
[2]   'http://dbpedia.org/resource/Category%3APolitics'

Reference

path = path-abempty    ; begins with "/" or is empty
     / path-absolute   ; begins with "/" but not "//"
     / path-noscheme   ; begins with a non-colon segment
     / path-rootless   ; begins with a segment
     / path-empty      ; zero characters
path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>
segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
@wouterbeek wouterbeek added the bug label Jul 30, 2017
@JanWielemaker
Copy link
Member

For query components you are probably right. For path components there is a problem that a relative uri can be mistaken for a fully qualified uri. That is what Samer discovered and has caused the current behaviour (older versions did not escape :). Some git blame and search on the mailinglist will probably find the discussion. This seems consistent with JavaScript encodeURIComponent(), which also escapes :.

I guess you want a canonical, minimally escaped URI? That is a different task that could be implemented in uri_normalized/2 (which now escapes : as it shares the code). Note that using
a : in a segment is allowed, but complicates the translation of an absolute URI into a relative one.

I surely wouldn't call this a bug ...

@wouterbeek
Copy link
Contributor Author

wouterbeek commented Jul 31, 2017

The use of an unescaped colon is actually not ambiguous. RFC 3986 took this into account:

A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative-path reference.

(I did not know this last year, otherwise I would gave given this pointer earlier.)

@JanWielemaker
Copy link
Member

Interesting. This probably does require a different set of URI encoding primitives than that what is current practice though. Notably we not only need something to encode, but also something to create a relative URI. But, who is going to call that where?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants