-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mention target and b59-cube #128
Conversation
giacomociti
commented
Dec 5, 2023
- mention the new commands in barnard59-cube
- mention addition of SHACL target by validation tool
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned yourself, I would hold this and the change the docs to use the b59 pipeline commands
documentation/tools.md
Outdated
The constraint should be a SHACL shape but it's not expected to have any [target declaration](https://www.w3.org/TR/shacl/#targets). | ||
The validation tool takes care of making all the observations a target for the constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giacomociti good point 👍
nit: I'd even suggest:
- it's expected to not have any [target declaration]
btw: this would also be worth mentioning in https://cube.link/#cubeconstraints - or would there be anything against that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments in code, I would also add a new section to show example usage the commands b59 cube fetch-metadata
and b59 cube fetch-observations
documentation/tools.md
Outdated
@@ -19,63 +19,88 @@ An example Cube is specified in [cube.ttl](cube.ttl). The cube provides a constr | |||
### Validate the cube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually propose making this a lvl 2 header and raising subsections with it. I see not reason to nest under Example Cube
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, also renamed " Validate the cube" to "Validating Cubes"
Hi @giacomociti Thank you for your important work for an improved validation method. Glad to share our review with you:
once this is fixed we need to include @kronmar for his review as well |
hi @Rdataflow , thanks for your feedback. I will ensure the pipelines run on windows and add a step to summarize the report. |
@Rdataflow , here's my analysis for your compatibility requirements. A cube has different kinds of data and metadata. Keeping everything in a single file is manageable only for small cubes. File To deal with even bigger cubes we had to split data even more. The initial version of the validation pipelines provides commands to:
the problemThe new pipelines split cube data differently than in the spec example: cube metadata and observations are together in The main motivation for splitting data differently is to leverage streaming allowing the processing of bigger cubes. To smooth the transition from existing validation tools and workflows, we may evolve the new pipelines allowing for more flexibility in their usage: it should be possible to process existing smaller cubes along the lines of the spec examples, and use the new approach for new or bigger cubes. the strategyA good strategy is to consider a sufficiently granular partition of the data and have separate but composable modules for each part.
fetching dataWe may have a general fetch command capable of getting any combination of the above parts, along with a few pre-defined sensible combinations. Skipping over details of CLI syntax, the current To get an idea of the amount of data and the complexity of their retrieval, this is a hint at a possible implementation for each part: metadataDESCRIBE <${cube}> constraintDESCRIBE ?constraint WHERE {
<${cube}> cube:observationConstraint ?constraint
} observation-setsCONSTRUCT { ?s cube:observation ?o } WHERE {
<${cube}> cube:observationSet ?s .
?s cube:observation ?o .
} observationsCONSTRUCT { ?s ?p ?o } WHERE {
<${cube}> cube:observationSet/cube:observation ?s .
?s ?p ?o
} defined-termsTo get data for shared dimensions and hierarchies, a starting point may be something like: DESCRIBE ?o WHERE {
<${cube}> cube:observationSet/cube:observation ?s .
?s ?p ?o
} but this may be incomplete and inefficient. Notice that part of this data is also in the constraint ( It may be helpful to have cube metadata listing which defined term sets are in use, but this is out of scope at the moment. ValidationFrom the spec:
For the first aspect (cube structure and contents) there is a simple shapes file ( Checking the structure of the observations requires at least the constraint part and the observations. To check the integrity of the constraints there are different profiles which usually require constraint and metadata. Final remarksWe propose a refactoring of the "fetching" part of the new pipelines, allowing for more options about what cube parts to retrieve, to better fulfill the needs of the corresponding "checking" commands. Another UX improvement to address concerns the choice of validation report (human readable vs. machine readable). |
Hi @giacomociti I just wanted to jump in to maybe stress something. Right now, we check our cubes before we upload them. It would be important for the us, if we could use this new pipeline in a similar fashion as a drop-in replacement of the current tools. I just wanted to stress that part. I don't see that as an issue, just wanted to mention that. One additional part jumped at me:
What if the |
hi @kronmar, thanks for reaching out. Your query may be a useful improvement, although we need to include also the Concerning validation before publishing, that's of course a good idea.
When checking observations, the same file is used twice, both as data file and as shapes file, much like with the existing tool:
If you have separate
|
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
Co-authored-by: Tomasz Pluskiewicz <[email protected]>
@Rdataflow I'm on a windows machine but could not reproduce the issue. Which version of node are you using? Btw I regularly work on windows but I use the Linux subsystem (WSL). I think this has many advantages for node appications |
more comments about the query proposed by @kronmar: first, I was thinking the triples with |
Using WSL, I was able to run the validator. @Rdataflow I had the same issue you had with |
I confirm it works here on Windows using node-20.10 (LTS) instead of node-21.5 👍 edit: unfortunately only
|
Without WSL, I can reproduce this issue. Although, I had a small issue beforehand which I could solve, a |
@giacomociti thank you for your in-depth response on the backward compatiblity and the use of streams. This helps to understand a lot better the strategy you followed and why in consequence you altered the structure. While modular design in many cases is useful - for this case I see an important disadvantage:
considering this:
we can also drop backwards compatibility and migrate to a new structure in order to reduce complexity - in case this is effectively required to stream observations ... |
@giacomociti thanks for sharing your update. The general direction looks good, just a few minor remarks:
|
hi @Rdataflow, concerning
the problem is that the This is a known problem we are trying to document together with a possible workaround (avoiding batching if the cube is small) or to address (see zazuko/barnard59#238) for bigger cubes. Notice that in LINDAS I could not find any cube with the |
@Rdataflow concerning this:
The --profile argument is only one, but on the server-side a profile can be composed including others with the |
good to see you document this edge case somewhere, just in case. (low effort is perfect as currently nobody depends on sh:Class 👍 @giacomociti the batch size isn't specified. It reproduces on multpile platforms as well as locally. |
is there anything more to fix or can we go ahead with merging this? |
@giacomociti everything good from my point of view 🎉 |