Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mention target and b59-cube #128

Merged
merged 15 commits into from
Feb 14, 2024
Merged

mention target and b59-cube #128

merged 15 commits into from
Feb 14, 2024

Conversation

giacomociti
Copy link
Contributor

  • mention the new commands in barnard59-cube
  • mention addition of SHACL target by validation tool

Copy link

changeset-bot bot commented Dec 5, 2023

⚠️ No Changeset found

Latest commit: 18e82f9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@giacomociti giacomociti requested a review from tpluscode December 5, 2023 13:48
Copy link
Contributor

@tpluscode tpluscode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned yourself, I would hold this and the change the docs to use the b59 pipeline commands

Comment on lines 59 to 60
The constraint should be a SHACL shape but it's not expected to have any [target declaration](https://www.w3.org/TR/shacl/#targets).
The validation tool takes care of making all the observations a target for the constraint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giacomociti good point 👍

nit: I'd even suggest:

  • it's expected to not have any [target declaration]

btw: this would also be worth mentioning in https://cube.link/#cubeconstraints - or would there be anything against that?

Copy link
Contributor

@tpluscode tpluscode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments in code, I would also add a new section to show example usage the commands b59 cube fetch-metadata and b59 cube fetch-observations

documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
documentation/tools.md Outdated Show resolved Hide resolved
@@ -19,63 +19,88 @@ An example Cube is specified in [cube.ttl](cube.ttl). The cube provides a constr
### Validate the cube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually propose making this a lvl 2 header and raising subsections with it. I see not reason to nest under Example Cube

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also renamed " Validate the cube" to "Validating Cubes"

@Rdataflow
Copy link
Contributor

Rdataflow commented Jan 10, 2024

Hi @giacomociti

Thank you for your important work for an improved validation method. Glad to share our review with you:

  • fix the new method to run on Windows as well... (the current runs perfectly)

    $ cat cube.ttl | npx b59 cube check-observations --constraint constraint.ttl
    (node:3140) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
    (Use `node --trace-deprecation ...` to show where the warning was created)
    error: could not load step b30_sortBySubject {"iri":"http://barnard59.zazuko.com/pipeline/cube-validation/check-observations","stack":"Error: could not load step b30_sortBySubject\n    at file:///C:/temp/validation-test/node_modules/barnard59-core/lib/factory/step.js:23:19\n    at async Pipeline.onInit (file:///C:/temp/validation-test/node_modules/barnard59-core/lib/factory/pipeline.js:46:35)\n    at async Pipeline._init (file:///C:/temp/validation-test/node_modules/barnard59-core/lib/Pipeline.js:65:13)\n    at async file:///C:/temp/validation-test/node_modules/barnard59/runner.js:19:7\n    at async file:///C:/temp/validation-test/node_modules/barnard59/lib/cli/runAction.js:34:51\n    at async default (file:///C:/temp/validation-test/node_modules/barnard59/lib/cli/runAction.js:27:3)\n    at async Command.parseAsync (C:\\temp\\validation-test\\node_modules\\commander\\lib\\command.js:940:5)\n    at async default (file:///C:/temp/validation-test/node_modules/barnard59/lib/cli.js:48:3)\n    at async file:///C:/temp/validation-test/node_modules/barnard59/bin/barnard59.js:99:3\nCaused by: Error [ERR_UNSUPPORTED_ESM_URL_SCHEME]: Only URLs with a scheme in: file, data, and node are supported by the default ESM loader. On Windows, absolute paths must be valid file:// URLs. Received protocol 'c:'\n    at throwIfUnsupportedURLScheme (node:internal/modules/esm/load:239:11)\n    at defaultLoad (node:internal/modules/esm/load:130:3)\n    at ModuleLoader.load (node:internal/modules/esm/loader:409:13)\n    at ModuleLoader.moduleProvider (node:internal/modules/esm/loader:291:56)\n    at new ModuleJob (node:internal/modules/esm/module_job:65:26)\n    at #createModuleJob (node:internal/modules/esm/loader:303:17)\n    at ModuleLoader.getJobFromResolveResult (node:internal/modules/esm/loader:260:34)\n    at ModuleLoader.getModuleJob (node:internal/modules/esm/loader:241:17)\n    at async ModuleLoader.import (node:internal/modules/esm/loader:328:23)","timestamp":"2024-01-10T15:08:35.386Z"}
    PipelineError: could not load step b30_sortBySubject
      at file:///C:/temp/validation-test/node_modules/barnard59-core/lib/factory/step.js:23:19
      at async Pipeline.onInit (file:///C:/temp/validation-test/node_modules/barnard59-core/lib/factory/pipeline.js:46:35)
      at async Pipeline._init (file:///C:/temp/validation-test/node_modules/barnard59-core/lib/Pipeline.js:65:13)
      at async file:///C:/temp/validation-test/node_modules/barnard59/runner.js:19:7
      at async file:///C:/temp/validation-test/node_modules/barnard59/lib/cli/runAction.js:34:51
      at async default (file:///C:/temp/validation-test/node_modules/barnard59/lib/cli/runAction.js:27:3)
      at async Command.parseAsync (C:\temp\validation-test\node_modules\commander\lib\command.js:940:5)
      at async default (file:///C:/temp/validation-test/node_modules/barnard59/lib/cli.js:48:3)
      at async file:///C:/temp/validation-test/node_modules/barnard59/bin/barnard59.js:99:3
    Caused by: Error [ERR_UNSUPPORTED_ESM_URL_SCHEME]: Only URLs with a scheme in: file, data, and node are supported by 
    the default ESM loader. On Windows, absolute paths must be valid file:// URLs. Received protocol 'c:'
      at throwIfUnsupportedURLScheme (node:internal/modules/esm/load:239:11)
      at defaultLoad (node:internal/modules/esm/load:130:3)
      at ModuleLoader.load (node:internal/modules/esm/loader:409:13)
      at ModuleLoader.moduleProvider (node:internal/modules/esm/loader:291:56)
      at new ModuleJob (node:internal/modules/esm/module_job:65:26)
      at #createModuleJob (node:internal/modules/esm/loader:303:17)
      at ModuleLoader.getJobFromResolveResult (node:internal/modules/esm/loader:260:34)
      at ModuleLoader.getModuleJob (node:internal/modules/esm/loader:241:17)
      at async ModuleLoader.import (node:internal/modules/esm/loader:328:23)
    
    edit: finally we can differentiate - check-metadata (runs as expected on windows) - check-observations (doesn't start on windows due to `ERR_UNSUPPORTED_ESM_URL_SCHEME`) - will be fixed with PR https://github.com/fix: windows compatibility to import modules using filename rdf-loader-code#34
  • add the steps how to extract the files cube.ttl and constraint.ttl from LINDAS using b59

  • ensure compatiblity with current toolset and validation workflow

    • cube.ttl containing cube:Cube, cube:ObservationSet and cube:Observation
    • constraint.ttl containing cube:ObservationConstraint
  • add step how to get human readable validation result (the current validation provides so directly)

    cube validation failed 
    Violation of NodeConstraintComponent: "cube:Cube needs at least one cube:ObservationSet" with path <https://cube.link/observationSet> at focus node <https://environment.ld.admin.ch/foen/gefahren-waldbrand-warnung/1> (source: _:b1) 
    Violation of MinCountConstraintComponent: "cube:ObservationSet needs at least one cube:observation" with path <https://cube.link/observation> at focus node <https://environment.ld.admin.ch/foen/gefahren-waldbrand-warnung/1/observation/> (source: _:b3)
    

once this is fixed we need to include @kronmar for his review as well

@giacomociti
Copy link
Contributor Author

hi @Rdataflow , thanks for your feedback. I will ensure the pipelines run on windows and add a step to summarize the report.
We may need more analysis and discussions about splitting data and metadata into cube and constraint files, trying to balance compatibility with existing workflows with the need to leverage streaming to process big cubes.

@giacomociti
Copy link
Contributor Author

@Rdataflow , here's my analysis for your compatibility requirements.

A cube has different kinds of data and metadata. Keeping everything in a single file is manageable only for small cubes.
The validation examples in the spec assume data is split into two files (cube.ttl and constraint.ttl, although there is a hint at using a single file completecube.ttl).

File constraint.ttl has only the constraint. File cube.ttl has all the rest: cube metadata (except the constraint), observation sets with their observations, and also the complete data for the referenced terms (shared dimensions and hierarchies).

To deal with even bigger cubes we had to split data even more. The initial version of the validation pipelines provides commands to:

  • fetch-metadata: fetch cube metadata and constraint
  • fetch-observations: fetch observations only (without links from observation sets nor shared dimensions data)
  • check-metadata: check fetched metadata against a profile
  • check-observations: check fetched observations against the constraint in fetched metadata

the problem

The new pipelines split cube data differently than in the spec example: cube metadata and observations are together in constraints.ttl but are fetched and used separately by the new pipelines.

The main motivation for splitting data differently is to leverage streaming allowing the processing of bigger cubes.

To smooth the transition from existing validation tools and workflows, we may evolve the new pipelines allowing for more flexibility in their usage: it should be possible to process existing smaller cubes along the lines of the spec examples, and use the new approach for new or bigger cubes.

the strategy

A good strategy is to consider a sufficiently granular partition of the data and have separate but composable modules for each part.

  • metadata
  • constraint
  • observation-sets
  • observations
  • defined-terms

fetching data

We may have a general fetch command capable of getting any combination of the above parts, along with a few pre-defined sensible combinations.

Skipping over details of CLI syntax, the current fetch-metadata and fetch-observations would correspond respectively to
fetch -- metadata constraint and fetch -- observations.
To get the contents of files constraint.ttl and cube.ttl like in the spec example, we may respectively use fetch -- constraint and fetch -- metadata observation-sets observations defined-terms.

To get an idea of the amount of data and the complexity of their retrieval, this is a hint at a possible implementation for each part:

metadata

DESCRIBE <${cube}>

constraint

DESCRIBE ?constraint WHERE {
   <${cube}> cube:observationConstraint ?constraint
}

observation-sets

CONSTRUCT { ?s cube:observation ?o } WHERE {
    <${cube}> cube:observationSet ?s .
    ?s cube:observation ?o .
}

observations

CONSTRUCT { ?s ?p ?o } WHERE {
    <${cube}> cube:observationSet/cube:observation ?s .
    ?s ?p ?o
}

defined-terms

To get data for shared dimensions and hierarchies, a starting point may be something like:

DESCRIBE ?o WHERE {
    <${cube}> cube:observationSet/cube:observation ?s .
     ?s ?p ?o
}

but this may be incomplete and inefficient. Notice that part of this data is also in the constraint (shacl:in clauses).

It may be helpful to have cube metadata listing which defined term sets are in use, but this is out of scope at the moment.

Validation

From the spec:

The validation process of the cube can be divided in three different aspects.

  • The cube structure and contents
  • The structure of the observations
  • The integrity of the constraints

For the first aspect (cube structure and contents) there is a simple shapes file (basic-cube-constraint.ttl). Unfortunately, it requires almost all the data (metadata, observation sets, observations) for a trivial validation.

Checking the structure of the observations requires at least the constraint part and the observations.
There's no need for the metadata part (even though we expect it to be small enough) and observation-sets.
Depending on the constraint (for example in case of sh:class shapes), defined-terms may be needed.
This is rarely the case (I couldn't find any in LINDAS), and it is challenging for big cubes (there is an attempt to address this but it's better discussed elsewhere).
For small cubes that fit in memory, the solution with the check-observations command is to disable streaming (setting --batchSize 0) and use as input not only the observations part but also the defined_terms.

To check the integrity of the constraints there are different profiles which usually require constraint and metadata.

Final remarks

We propose a refactoring of the "fetching" part of the new pipelines, allowing for more options about what cube parts to retrieve, to better fulfill the needs of the corresponding "checking" commands.
We will revise the CLI syntax to maximize the ease of use for common cases, with proper documentation covering also more specific cases.

Another UX improvement to address concerns the choice of validation report (human readable vs. machine readable).

@kronmar
Copy link
Contributor

kronmar commented Jan 11, 2024

Hi @giacomociti

I just wanted to jump in to maybe stress something. Right now, we check our cubes before we upload them. It would be important for the us, if we could use this new pipeline in a similar fashion as a drop-in replacement of the current tools. I just wanted to stress that part. I don't see that as an issue, just wanted to mention that.

One additional part jumped at me:

observations

CONSTRUCT { ?s ?p ?o } WHERE {
    <${cube}> cube:observationSet/cube:observation ?s .
    ?s ?p ?o
}

What if the Cube:ObservationConstraint has some sh:ignoredProperties? You still want to fetch them or not? For the validation process, that's irrelevant, as the validation still will of course ignore those properties, it might fetch unneeded data.
I wrote my own fetcher a couple months ago, and I used this query to achieve that which I feel is more to the point.

@giacomociti
Copy link
Contributor Author

hi @kronmar, thanks for reaching out. Your query may be a useful improvement, although we need to include also the rdf:type property because the validation targets the cube:Observation class.

Concerning validation before publishing, that's of course a good idea.
Maybe I'm missing something concerning the compatibility with the current toolset and validation workflow, but I think
you can still use a single completecube.ttl file or two constraint.ttl and cube.ttl files:

cat completecube.ttl \
| npx barnard59 cube check-metadata \
    --profile https://cube.link/v0.1.0/shape/standalone-constraint-constraint

cat completecube.ttl \
| npx barnard59 cube check-observations \
    --constraint completecube.ttl

When checking observations, the same file is used twice, both as data file and as shapes file, much like with the existing tool:

./bin/cube-link.js validate completecube.ttl completecube.ttl

If you have separate constraint.ttl and cube.ttl files:

cat constraint.ttl \
| npx barnard59 cube check-metadata \
    --profile https://cube.link/v0.1.0/shape/standalone-constraint-constraint

cat cube.ttl \
| npx barnard59 cube check-observations \
    --constraint constraint.ttl

giacomociti and others added 9 commits January 12, 2024 14:10
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
@giacomociti
Copy link
Contributor Author

@Rdataflow I'm on a windows machine but could not reproduce the issue. Which version of node are you using?

Btw I regularly work on windows but I use the Linux subsystem (WSL). I think this has many advantages for node appications

@giacomociti
Copy link
Contributor Author

more comments about the query proposed by @kronmar: first, I was thinking the triples with rdf:type cube:Observation were missing but I was wrong: they're there.
The only issue I see is that if a cube lacks the constraint, no observations are returned.

@kronmar
Copy link
Contributor

kronmar commented Jan 16, 2024

Using WSL, I was able to run the validator.

@Rdataflow I had the same issue you had with punycode. Switching from node-v21.6.0 to node-v20.10.0 (the LTS version) fixed that issue for me.

@Rdataflow
Copy link
Contributor

Rdataflow commented Jan 16, 2024

I confirm it works here on Windows using node-20.10 (LTS) instead of node-21.5 👍

edit: unfortunately only check-metadata works
thus to differentiate

  • check-metadata (runs as expected on windows)
  • check-observations (doesn't start on windows due to ERR_UNSUPPORTED_ESM_URL_SCHEME)

@kronmar
Copy link
Contributor

kronmar commented Jan 17, 2024

I confirm it works here on Windows using node-20.10 (LTS) instead of node-21.5 👍

edit: unfortunately only check-metadata works thus to differentiate

  • check-metadata (runs as expected on windows)
  • check-observations (doesn't start on windows due to ERR_UNSUPPORTED_ESM_URL_SCHEME)

Without WSL, I can reproduce this issue. Although, I had a small issue beforehand which I could solve, a ERR_MODULE_NOT_FOUND for barnard59-base... not sure if that was an issue on my end.

@Rdataflow
Copy link
Contributor

@giacomociti thank you for your in-depth response on the backward compatiblity and the use of streams. This helps to understand a lot better the strategy you followed and why in consequence you altered the structure.

While modular design in many cases is useful - for this case I see an important disadvantage:

  • we need to ensure every sh:targetClass is applied to all the relevant data.
    • a task which is already complex today.
    • nit: i.e. currently we don't apply
      sh:targetClass cube:Constraint ;
      as this profile isn't applied for constraint.ttl according to the docs
    • upon dividing cubes into n pieces while having m profiles this results in m * n possible checks

considering this:

  • how can we keep things as simple as possible (while still going for streaming the observations of course)?

we can also drop backwards compatibility and migrate to a new structure in order to reduce complexity - in case this is effectively required to stream observations ...

@Rdataflow
Copy link
Contributor

Rdataflow commented Jan 29, 2024

@giacomociti thanks for sharing your update. The general direction looks good, just a few minor remarks:

  • there should always be a SHACL validation report - also on success (I didn't get one on success)

    • cat test/observations/undefinedAllowed.ttl | barnard59 cube check-observations --constraint test/observations/undefinedAllowed.ttl
  • the example cube no longer validates (in constrast to the existing validation tool)

    • cat cube.ttl | barnard59 cube check-observations --constraint constraint.ttl | barnard59 shacl report-summary gets 13 errors (ClassConstraintComponent) see also 4a3d8d9 (your existing example cube) causes fails validation while ./bin/cube-link.js validate cube.ttl constraint.ttl succeeds
  • nit: allow windows fix: softfail if rmSync fails giacomociti/external-merge-sort#1

  • curious: is it possible to have multiple --profile options in the same command? i.e. will those profiles be combined?

@giacomociti
Copy link
Contributor Author

hi @Rdataflow, concerning

the example cube no longer validates

the problem is that the sh:class constraint is at odd with streaming: in order to check that an observation value is an instance of a given class, we need to look for data which is probably not within the chunk being validated.

This is a known problem we are trying to document together with a possible workaround (avoiding batching if the cube is small) or to address (see zazuko/barnard59#238) for bigger cubes.

Notice that in LINDAS I could not find any cube with the sh:class constraint, and the example in cube.link should be small enough to fit in a single chunk (with the default batch size of 50). In fact I can reproduce the error only adding a very small batch size explicitly (--batchSize 3) so I am curious how you could have the error with the example cube.

@giacomociti
Copy link
Contributor Author

@Rdataflow concerning this:

curious: is it possible to have multiple --profile options in the same command? i.e. will those profiles be combined?

The --profile argument is only one, but on the server-side a profile can be composed including others with the code:imports predicate: see for example https://github.com/zazuko/cube-link/blob/4df403bf1e9062b0e2a63bf6b0d287fd029d35e6/validation/profile-opendataswiss.ttl#L18C5-L18C17

@Rdataflow
Copy link
Contributor

hi @Rdataflow, concerning

the example cube no longer validates

the problem is that the sh:class constraint is at odd with streaming: in order to check that an observation value is an instance of a given class, we need to look for data which is probably not within the chunk being validated.

This is a known problem we are trying to document together with a possible workaround (avoiding batching if the cube is small) or to address (see zazuko/barnard59#238) for bigger cubes.

Notice that in LINDAS I could not find any cube with the sh:class constraint, and the example in cube.link should be small enough to fit in a single chunk (with the default batch size of 50). In fact I can reproduce the error only adding a very small batch size explicitly (--batchSize 3) so I am curious how you could have the error with the example cube.

good to see you document this edge case somewhere, just in case. (low effort is perfect as currently nobody depends on sh:Class 👍

@giacomociti the batch size isn't specified. It reproduces on multpile platforms as well as locally.
repro case: https://github.com/Rdataflow/rdf-cube-schema/actions/runs/7697065092/job/20973256592
up to you how you fix it. a solution would be to remove sh:Class so the example validates again properly.

@giacomociti
Copy link
Contributor Author

is there anything more to fix or can we go ahead with merging this?

@Rdataflow
Copy link
Contributor

@giacomociti everything good from my point of view 🎉

@giacomociti giacomociti merged commit 0bc0469 into main Feb 14, 2024
10 checks passed
@giacomociti giacomociti deleted the validate-docs branch February 14, 2024 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants