Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify handling of invalid codepoint escape sequences #164

Open
kasei opened this issue Oct 24, 2024 · 6 comments
Open

Clarify handling of invalid codepoint escape sequences #164

kasei opened this issue Oct 24, 2024 · 6 comments
Labels
Errata Errata management: confirmed erratum

Comments

@kasei
Copy link
Contributor

kasei commented Oct 24, 2024

I think the current spec text is ambiguous about how codepoint escape sequences should be handled if they are invalid. For example:

SELECT * WHERE { ?s ?p "\\u000Z" }

I think we might want to consider adding (either normative or best-practice) text about how this case should be handled. It seems like several systems (including my own, and Jena) ignore invalid sequences, causing the above query to have a literal that starts with an escaped backslash, followed by the four characters "000Z". Other systems might see the \u with invalid trailing characters and raise an error. Having clarity on the expected behavior here would be good.

@afs
Copy link
Contributor

afs commented Oct 25, 2024

the current spec text is ambiguous

@kasei -

I'm guessing you are referring to the text
"processed for codepoint escape sequences before parsing"

What alternative readings do you see?

The way Turtle handles this differently to SPARQL - it has UCHAR in the grammar and that occurs in strings and URIs.
(I think the Turtle way is better but we are where we are.)

@kasei
Copy link
Contributor Author

kasei commented Oct 25, 2024

@afs

I'm guessing you are referring to the text "processed for codepoint escape sequences before parsing"

What alternative readings do you see?

I think it's unclear what "processed" means here. Should the example I gave me an error? That is, does the simple appearance of \u imply a codepoint escape, and the trailing characters being non-HEX mean an error? Or is the sequence only handled as an escape if it matches the 6-character pattern \u HEX HEX HEX HEX (similarly for \U)? Usually this difference won't matter because if you don't handle a non-HEX sequence, you'll likely get a syntax error because \u isn't going to be valid (e.g. as "Escape sequences in strings"). However, the example above shows that it is possible to get a valid query if you skip the "before parsing" stage, and let the backslash part of \u participate as the escaped character of a string escape sequence.

The way Turtle handles this differently to SPARQL - it has UCHAR in the grammar and that occurs in strings and URIs. (I think the Turtle way is better but we are where we are.)

Agreed the Turtle handling is better, and also that "we are where we are." So I'm just looking to start a discussion on which of the two possibilities I note above is the expected behavior (if we can find consensus on that), and hoping we can add some text indicating that expectation.

@afs
Copy link
Contributor

afs commented Oct 25, 2024

An alternative is to develop some rdf-tests. The discussion would reach practitioners.

Here are some more examples to add to the collection:

SELECT * WHERE { ?s ?p "\\u0041" }
SELECT * WHERE { ?s ?p "\\u0074" }
SELECT * WHERE { ?s ?p "\u005Cn" }
SELECT * WHERE { ?s ?p "\u005C\u005Cn" }

Hex x41 is A -- \A is illegal.
Hex x74 is t -- may become a tab (!).
Hex x5C is \.

There is an argument SPARQL should switch to Turtle-style on security grounds because of the obfuscation possibilities.

\u0041\u0053\u004B\u0020\u007B\u007D

which is ASK{}

@kasei
Copy link
Contributor Author

kasei commented Oct 25, 2024

Agreed. I can try to work on a PR with some tests in this area (using both approaches) and we could solicit feedback from implementors.

@afs afs added the Errata Errata management: confirmed erratum label Oct 29, 2024
@afs
Copy link
Contributor

afs commented Oct 29, 2024

@kasei - thank you for the tests.

I think there are some specific points with the current spec text that are "errata":

  • What is a codepoint escape sequence? is it something that starts \u/\U or is it only a codepoint escape sequence if it is valid?
  • How is replacement done? Is it left-to-right? What about overlaps? Is it applied repeatedly?

@kasei
Copy link
Contributor Author

kasei commented Oct 29, 2024

Yes, we can raise the issues in the errata. Hopefully we can discuss in the WG and get clarity on the issue so that we can also address those issues in 1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Errata Errata management: confirmed erratum
Projects
None yet
Development

No branches or pull requests

2 participants