-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expressions with Unicode #44
Comments
The pattern is but a simple escaped regex. You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec This totally works: [ sh:pattern "\\S+" ] |
@tpluscode Thanks for your response!
I see the following trail when I look through the standard:
The main discussion I think is whether XSD 1.0 or XSD 1.1 should be used. To be honest, I like your regex notation better, since it is a bit simpler :-). However, I can imagine that there is benefit from following the specification. There may be cases in which a regular expression stored in SHACL can be matched and reused in a SPARQL query. (I'm not sure whether this is a good use case, but what I'm getting at is that when the same regex notation is used across SHACL, SPARQL and XSD this may facilitate cross over use cases.) |
I must admit I am a little confused myself, not having dug deep before. You seem correct about how you followed you nose from SHACL to XSD specs. Section 7.1 of XPath seems to suggest that XSD 1.1 should be used, does it? That said, the examples in SHACL spec to use the simple escaping (it's pretty much just the backslash). And FWIW the section for
This is definitely valid SPARQL :) filter ( regex( ?name, "^\\S+" ) ) |
@tpluscode Thanks, XSD 1.1 indeed seems to be the intended standard for regex in SHACL (and SPARQL). I do not have enough knowledge of XSD to determine whether
Maybe Whatever the case may be, some regex strings that seem to be valid in XSD 1.1 do not seem to be supported by this SHACL library. Maybe this is not so bad: the XSD 1.1 standard is sufficiently unreadable to prevent large groups of users from picking up the regex grammar described in it. Maybe the de facto way of writing regex is more popular. |
Thanks for maintaining this great library!
Observation
I'm unable to properly validate place names using
sh:pattern
. Place names may include spaces, single quotes, hyphens, and some non-ASCII Unicode characters. Examples of place names that should succeed are's-Gravenhage
,The Hague
, andKöln
.If I understand the somewhat cryptic XSD standard (link), then this should be expressible in the following way:
But the following data does not validate:
Since many natural languages include characters that do not occur in simple ASCII ranges like
[A-Za-z]
, and because natural language information is very common in RDF data, support for validating Unicode strings insh:pattern
is useful in many cases.Expected
The ability to use category escapes in
sh:pattern
, specifically for natural language content for which simple ranges are difficult/impossible to express.The text was updated successfully, but these errors were encountered: