syn-invalid-codepoint-escaped-bad-01.rq should pass #167

lu-pl · 2025-01-28T08:31:21Z

A test for a simple Lark SPARQL 1.1 parser fails for sparql/sparql11/syntax-query/syn-invalid-codepoint-escaped-bad-01.rq because the parser is able to parse the query and doesn't fail.

SELECT * WHERE {
	?s <http://a.example/p1> '\uD800'
}

I feel like this should pass actually, since the object is just a string literal.

I tried this query with the GraphDB and Wikidata/Blazegraph SPARQL interfaces and both accept the query.

The text was updated successfully, but these errors were encountered:

Tpt · 2025-01-28T08:36:39Z

Thank you for reporting this! RDF literal values must be valid UTF-8 strings and \uD800 is high surrogate so not valid as a single code point in UTF-8. My guess is that GraphDB and Blazegraph both do not reject this query because they are written in java and java String allow unpaired surrogates.

afs · 2025-01-28T11:01:11Z

Jena gets this right (negative synatx) in strict SPARQL 1.1/1.2 parsing; Jena gets it wrong in normal mode and that's a bug.

To pass the test, a system has to check the string for surrogates and reject them because they are not legal in an XSD string (which is what that literal is). In RDF 1.2 there are "RDF Strings" (no surrogates) for lexical forms which will make it clearer.

The way \u and \U are handled in SPARQL is different to Turtle, and IMO Turtle is the right way, SPARQL is the wrong way (but it's too late to change it). The SPARQL spec puts \u and \U decoding as part of the input stream handling. See #67.

JavaCC has built-in support for bytes-to-java that includes \u (but not \U) handling.
Essentially, it has it's own bytes to character decoder. JavaCC has to be configured to use it.

Jena's two parsers (SPARQL strict and normal-with-extensions) configure JavaCC differently.
Normal mode reads without javacc decoding, and it allows unicode escapes like Turtle does, including \U in strings.
SPARQL strict uses JavaCCs handling.

(And Java's character handling of UTF-8 is not strict as @Tpt says.)

lu-pl · 2025-01-28T14:53:37Z

Very interesting, thank you for the explanation!

So, I am wondering: Should a parser syntactically reject the above query or is the test actually targeting functionality that should be implemented in a pre- or post-processing step and apart from parsing? Because I think that grammatically, the above should be fine, see String in the grammar.

afs · 2025-01-29T10:26:14Z

I checked the spec and ... it's not completely clear. Issue raised: w3c/sparql-query#189

Should a parser syntactically reject the above query

rdf-test is useful in reflecting community interpretation and agreement.
The community thinks "yes".

'\uD800' is bad in two ways

Lone surrogates are not legal.
Surrogates as pairs are not allowed in UTF-8.

afs mentioned this issue Jan 29, 2025

SPARQL String. Unicode escapes exclude surrogates. w3c/sparql-query#190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syn-invalid-codepoint-escaped-bad-01.rq should pass #167

syn-invalid-codepoint-escaped-bad-01.rq should pass #167

lu-pl commented Jan 28, 2025

Tpt commented Jan 28, 2025 •

edited

Loading

afs commented Jan 28, 2025 •

edited

Loading

lu-pl commented Jan 28, 2025 •

edited

Loading

afs commented Jan 29, 2025

syn-invalid-codepoint-escaped-bad-01.rq should pass #167

syn-invalid-codepoint-escaped-bad-01.rq should pass #167

Comments

lu-pl commented Jan 28, 2025

Tpt commented Jan 28, 2025 • edited Loading

afs commented Jan 28, 2025 • edited Loading

lu-pl commented Jan 28, 2025 • edited Loading

afs commented Jan 29, 2025

Tpt commented Jan 28, 2025 •

edited

Loading

afs commented Jan 28, 2025 •

edited

Loading

lu-pl commented Jan 28, 2025 •

edited

Loading