Clarify handling of invalid codepoint escape sequences #164

kasei · 2024-10-24T23:44:27Z

I think the current spec text is ambiguous about how codepoint escape sequences should be handled if they are invalid. For example:

SELECT * WHERE { ?s ?p "\\u000Z" }

I think we might want to consider adding (either normative or best-practice) text about how this case should be handled. It seems like several systems (including my own, and Jena) ignore invalid sequences, causing the above query to have a literal that starts with an escaped backslash, followed by the four characters "000Z". Other systems might see the \u with invalid trailing characters and raise an error. Having clarity on the expected behavior here would be good.

The text was updated successfully, but these errors were encountered:

afs · 2024-10-25T08:03:58Z

the current spec text is ambiguous

@kasei -

I'm guessing you are referring to the text
"processed for codepoint escape sequences before parsing"

What alternative readings do you see?

The way Turtle handles this differently to SPARQL - it has UCHAR in the grammar and that occurs in strings and URIs.
(I think the Turtle way is better but we are where we are.)

kasei · 2024-10-25T14:17:00Z

@afs –

I'm guessing you are referring to the text "processed for codepoint escape sequences before parsing"

What alternative readings do you see?

I think it's unclear what "processed" means here. Should the example I gave me an error? That is, does the simple appearance of \u imply a codepoint escape, and the trailing characters being non-HEX mean an error? Or is the sequence only handled as an escape if it matches the 6-character pattern \u HEX HEX HEX HEX (similarly for \U)? Usually this difference won't matter because if you don't handle a non-HEX sequence, you'll likely get a syntax error because \u isn't going to be valid (e.g. as "Escape sequences in strings"). However, the example above shows that it is possible to get a valid query if you skip the "before parsing" stage, and let the backslash part of \u participate as the escaped character of a string escape sequence.

The way Turtle handles this differently to SPARQL - it has UCHAR in the grammar and that occurs in strings and URIs. (I think the Turtle way is better but we are where we are.)

Agreed the Turtle handling is better, and also that "we are where we are." So I'm just looking to start a discussion on which of the two possibilities I note above is the expected behavior (if we can find consensus on that), and hoping we can add some text indicating that expectation.

afs · 2024-10-25T16:44:21Z

An alternative is to develop some rdf-tests. The discussion would reach practitioners.

Here are some more examples to add to the collection:

SELECT * WHERE { ?s ?p "\\u0041" }
SELECT * WHERE { ?s ?p "\\u0074" }
SELECT * WHERE { ?s ?p "\u005Cn" }
SELECT * WHERE { ?s ?p "\u005C\u005Cn" }

Hex x41 is A -- \A is illegal.
Hex x74 is t -- may become a tab (!).
Hex x5C is \.

There is an argument SPARQL should switch to Turtle-style on security grounds because of the obfuscation possibilities.

\u0041\u0053\u004B\u0020\u007B\u007D

which is ASK{}

kasei · 2024-10-25T16:56:19Z

Agreed. I can try to work on a PR with some tests in this area (using both approaches) and we could solicit feedback from implementors.

afs · 2024-10-29T13:15:40Z

@kasei - thank you for the tests.

I think there are some specific points with the current spec text that are "errata":

What is a codepoint escape sequence? is it something that starts \u/\U or is it only a codepoint escape sequence if it is valid?
How is replacement done? Is it left-to-right? What about overlaps? Is it applied repeatedly?

kasei · 2024-10-29T16:44:10Z

Yes, we can raise the issues in the errata. Hopefully we can discuss in the WG and get clarity on the issue so that we can also address those issues in 1.2.

kasei mentioned this issue Oct 28, 2024

Add syntax tests for codepoint escaping. w3c/rdf-tests#151

Open

afs added the Errata Errata management: confirmed erratum label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify handling of invalid codepoint escape sequences #164

Clarify handling of invalid codepoint escape sequences #164

kasei commented Oct 24, 2024

afs commented Oct 25, 2024

kasei commented Oct 25, 2024 •

edited by afs

Loading

afs commented Oct 25, 2024

kasei commented Oct 25, 2024

afs commented Oct 29, 2024

kasei commented Oct 29, 2024

Clarify handling of invalid codepoint escape sequences #164

Clarify handling of invalid codepoint escape sequences #164

Comments

kasei commented Oct 24, 2024

afs commented Oct 25, 2024

kasei commented Oct 25, 2024 • edited by afs Loading

afs commented Oct 25, 2024

kasei commented Oct 25, 2024

afs commented Oct 29, 2024

kasei commented Oct 29, 2024

kasei commented Oct 25, 2024 •

edited by afs

Loading