Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to validate literals based on their datatype IRI? #46

Closed
wouterbeek opened this issue Jan 1, 2021 · 5 comments
Closed

How to validate literals based on their datatype IRI? #46

wouterbeek opened this issue Jan 1, 2021 · 5 comments

Comments

@wouterbeek
Copy link

wouterbeek commented Jan 1, 2021

I do not understand how literals should be validated based on their datatype IRI. I make the following observations:

  1. For some literals specifying the datatype IRI with sh:datatype seems to suffice in order to also check their lexical form. An example of this is xsd:boolean, where lexical form "-false" is currently not accepted because the minus sign is not part of the syntax for Boolean lexical forms.

  2. For some literals specifying the datatype IRI with sh:datatype does not seem sufficient, since incorrect lexical forms are still accepted. An example of this is xsd:double for which "--1.1e0" is accepted, even though the double occurrence of the hyphen is not supported by the floating-point syntax.

  3. At the same time, it is also not clear how regular expressions could be manually specified in order to fix the absence of lexical form validation (see Regular expressions with Unicode #44 for generic issues with the way in which regular expressions are currently supported). For example, specifying the regular expression sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" copied from the XSD standard alongside sh:datatype xsd:double still allows validates literals like "--1.1e1"^^xsd:double as ok, even though they violate both the datatype IRI and the regular expression specifications.

At the moment it is difficult for me to determine what is intended behavior and what is a bug. It would be great if SHACL could be used to validate literals, but I am not sure whether (1) such validation is indeed intended by the SHACL standard, and whether (2) it is technologically feasible to implement such validation with contemporary technology.

@tpluscode
Copy link
Collaborator

Could you provide the above cases complete with shapes and sample data?

Also, please check with SHACL playground to see what are the results there

@wouterbeek
Copy link
Author

wouterbeek commented Jan 1, 2021

@tpluscode I have not done anything complicated yet. I think that even the most simple things like the XSD literals do not work. I can still share my files of course :-)

This is my data file:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
[ a <C>;
  <p> "-false"^^xsd:boolean; # This will not validate when `sh:datatype xsd:boolean` is used.
  <r> "--1.1e0"^^xsd:double ]. # This will validate when `sh:datatype xsd:double` is used.

And this is my patterns file:

prefix sh: <http://www.w3.org/ns/shacl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

[ sh:property
    [ sh:datatype xsd:boolean;
      sh:path <p> ],
    [ sh:datatype xsd:double;
      sh:path <r>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" ]; # This does not do anything at all IIUC.
  sh:targetClass <C> ].

@wouterbeek
Copy link
Author

I have added a couple more example. This is mostly a copy/paste from the XSD standard. I have replaced backward slashes with double backward slashes, since this seems to be required. Since I do not know the Regex grammar, I do know whether the Regexes are valid (the library does not give feedback when a Regex cannot be processed).

This is my patterns file:

prefix sh: <http://www.w3.org/ns/shacl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

[ sh:property
    [ sh:datatype xsd:boolean;
      sh:path <boolean>;
      sh:pattern "false|true|0|1" ],
    [ sh:datatype xsd:date;
      sh:path <date>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:dateTime;
      sh:path <dateTime>;
      sh:pattern """
-?([1-9][0-9]{3,}|0[0-9]{3})
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
T(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\\.[0-9]+)?|(24:00:00(\\.0+)?))
(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?""" ],
    [ sh:datatype xsd:decimal;
      sh:path <decimal>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" ],
    [ sh:datatype xsd:double;
      sh:path <double>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)([Ee](\\+|-)?[0-9]+)? |(\\+|-)?INF|NaN" ],
    [ sh:datatype xsd:duration;
      sh:path <duration>;
      sh:pattern """
-?P( ( ( [0-9]+Y([0-9]+M)?([0-9]+D)?
       | ([0-9]+M)([0-9]+D)?
       | ([0-9]+D)
       )
       (T ( ([0-9]+H)([0-9]+M)?([0-9]+(\\.[0-9]+)?S)?
          | ([0-9]+M)([0-9]+(\\.[0-9]+)?S)?
          | ([0-9]+(\\.[0-9]+)?S)
          )
       )?
    )
  | (T ( ([0-9]+H)([0-9]+M)?([0-9]+(\\.[0-9]+)?S)?
       | ([0-9]+M)([0-9]+(\\.[0-9]+)?S)?
       | ([0-9]+(\\.[0-9]+)?S)
       )
    )
  )""" ],
    [ sh:datatype xsd:gMonth;
      sh:path <gMonth>;
      sh:pattern "--(0[1-9]|1[0-2])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:gYear;
      sh:path <gYear>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:gYearMonth;
      sh:path <gYearMonth>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:string;
      sh:path <string>;
      sh:pattern "\\S" ],
    [ sh:datatype xsd:time;
      sh:path <time>;
      sh:pattern "(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\\.[0-9]+)?|(24:00:00(\\.0+)?))(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ];
  sh:targetClass <C> ].

This is my data file:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
<i>
  a <C>;
  <boolean> false, "0"^^xsd:boolean;
  <date> "-1-01-01"^^xsd:date;
  <dateTime> "-1-01-01T00:00:00-00:00"^^xsd:dateTime;
  <decimal> -01.10, "-02.20"^^xsd:decimal;
  <double> -1.1e+0, "-2.2e+0"^^xsd:double;
  <duration> "-1-01-01T00:00:00-00:00"^^xsd:duration;
  <gMonth> "--01"^^xsd:gMonth;
  <gYear> "-1"^^xsd:gYear, "111111"^^xsd:gYear;
  <gYearMonth> "-1-01Z"^^xsd:gYear, "111111-01Z"^^xsd:gYear;
  <string> "😺", "😺"^^xsd:string;
  <time> "00:00:00-00:00"^^xsd:time.

Since Regex is a crude approach for validating lexical forms, it would be better if lexical forms could also be validated by specifying the datatype IRI (sh:datatype). If that is not feasible, then having proper Regex support would at least allow us to add sh:pattern triples based on the presence of sh:datatype triples.

@tpluscode
Copy link
Collaborator

tpluscode commented Jan 4, 2021

After looking at your examples in the SHACL playground and the spec I have a few observations:

  1. Boolean acts wrong, where the library treats the truthiness of the literal. Thus 0 becomes false and pretty much anything else becomes true. We probably inherited that issue too
  2. You got those regex from W3C XML Schema? I think the whitespace is a problem in some. For example, the double expression has a space before the |(\\+|-)?INF|NaN patterns. Remove that space and it will work
  3. Otherwise you will need to add start/end of line symbols ^$. Without them you risk matching only portion of the literal.
  4. Strangely, decimal actually gets validated by the datatype constraint alone
  5. The regex created by the library probably needs a u flag to handle unicode correctly
    image

Now, while the spec does not mention checking the lexical correctness of literals, it could be added as an option to the library. What do you think @martinmaillard ?

@martinmaillard
Copy link
Contributor

This library already uses rdf-validate-datatype to validate the lexical correctness of literals. So if something gets validated wrong, it's probably a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants