Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syn-invalid-codepoint-escaped-bad-01.rq should pass #167

Open
lu-pl opened this issue Jan 28, 2025 · 4 comments
Open

syn-invalid-codepoint-escaped-bad-01.rq should pass #167

lu-pl opened this issue Jan 28, 2025 · 4 comments

Comments

@lu-pl
Copy link

lu-pl commented Jan 28, 2025

A test for a simple Lark SPARQL 1.1 parser fails for sparql/sparql11/syntax-query/syn-invalid-codepoint-escaped-bad-01.rq because the parser is able to parse the query and doesn't fail.

SELECT * WHERE {
	?s <http://a.example/p1> '\uD800'
}

I feel like this should pass actually, since the object is just a string literal.

I tried this query with the GraphDB and Wikidata/Blazegraph SPARQL interfaces and both accept the query.

@Tpt
Copy link
Contributor

Tpt commented Jan 28, 2025

Thank you for reporting this! RDF literal values must be valid UTF-8 strings and \uD800 is high surrogate so not valid as a single code point in UTF-8. My guess is that GraphDB and Blazegraph both do not reject this query because they are written in java and java String allow unpaired surrogates.

@afs
Copy link
Contributor

afs commented Jan 28, 2025

Jena gets this right (negative synatx) in strict SPARQL 1.1/1.2 parsing; Jena gets it wrong in normal mode and that's a bug.

To pass the test, a system has to check the string for surrogates and reject them because they are not legal in an XSD string (which is what that literal is). In RDF 1.2 there are "RDF Strings" (no surrogates) for lexical forms which will make it clearer.

The way \u and \U are handled in SPARQL is different to Turtle, and IMO Turtle is the right way, SPARQL is the wrong way (but it's too late to change it). The SPARQL spec puts \u and \U decoding as part of the input stream handling. See #67.

JavaCC has built-in support for bytes-to-java that includes \u (but not \U) handling.
Essentially, it has it's own bytes to character decoder. JavaCC has to be configured to use it.

Jena's two parsers (SPARQL strict and normal-with-extensions) configure JavaCC differently.
Normal mode reads without javacc decoding, and it allows unicode escapes like Turtle does, including \U in strings.
SPARQL strict uses JavaCCs handling.

(And Java's character handling of UTF-8 is not strict as @Tpt says.)

@lu-pl
Copy link
Author

lu-pl commented Jan 28, 2025

Very interesting, thank you for the explanation!

So, I am wondering: Should a parser syntactically reject the above query or is the test actually targeting functionality that should be implemented in a pre- or post-processing step and apart from parsing? Because I think that grammatically, the above should be fine, see String in the grammar.

@afs
Copy link
Contributor

afs commented Jan 29, 2025

I checked the spec and ... it's not completely clear. Issue raised: w3c/sparql-query#189

Should a parser syntactically reject the above query

rdf-test is useful in reflecting community interpretation and agreement.
The community thinks "yes".

'\uD800' is bad in two ways

  • Lone surrogates are not legal.
  • Surrogates as pairs are not allowed in UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants