-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to retain comments? #2
Comments
Hi, this is something that has been bothering me as well for a long time (much longer than this project exists). The answer to a) is easy: I'd love to have that option. However, the problem here is that the RDF libraries throw away all information about comments in their parsers. I know this to be true for Apache Jena and believe RDF4J works similarly. But even if we were to implement our own RDF/Turtle parser, a regular Jena model does not even has methods to work with the comments. I see only three ways to deal with this:
TBH: I have thought about option 2 more than once, but it's not going to happen soon (from my side) due to limited time. So the answer to b) is unfortunately "very hard". |
That is indeed unfortunate. Thank you for the detailed explanation though! Both non-stoic solutions are way out-of-scope for the project I'm working on, so I cannot really commit to those. If you ever do get started on option 2 though, feel free to ping me and I'll try to help. |
Just FYI: Apparently |
@renefritze How can you determine to WHAT a comment applies? The only format that caters for comments is Relax NG Compact (a XSD schema language). Eg this
is translated to <rng:element name="Subject">
<rng:interleave><rng:element name="Ancestors">
<a:documentation>
There can be one and only one preferred ancester branch, but zero to many non-preferrened ancester branches
</a:documentation>
<rng:interleave><rng:element name="Preferred_Ancestor_Branch"> But RDF doesn't have such features. As a consolation, every comment in an ontology should usefully be expressed as a specific annotation, eg I think you should close this |
Hi @VladimirAlexiev, of course you are correct in that there is no built-in logic in RDF to what a comment applies. But there are ways that can be addressed on the level of the Turtle format, and there are valid use cases for this, which is why this issue is not closed and should be addressed instead. I think there are roughly four types of comments that appear in RDF/Turtle:
By hooking into the parser (or creating a custom parser, see my first comment), you can figure out which type of comment you have and then either reifiy it in the model or provide access to it using a custom API (which I mean with "an own implementation of a Jena model"). Let's say you have this file: # Copyright ACME Corporation
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
#
# Classes
#
# The class of all persons
:Person a rdfs:Class .
#
# Properties
#
# The name of a person
:name a rdf:Property . You could end up with a model with reified comments: @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix formatting-meta: <https://github.com/atextor/turtleformatter/meta#> .
:Person a rdfs:Class ;
formatting-meta:comment "The class of all persons" .
:name a rdf:Property ;
formatting-meta:comment "The name of a person" .
[
a formatting-meta:headerComment "Copyright ACME Corporation"
]
[
a formatting-meta:sectionComment """\nClasses\n\n"""
]
[
a formatting-meta:sectionComment """\nProperties\n\n"""
] What is missing here is the location information about the section comments, which might work like this:
@fkleedorfer has already created the base for this custom-parser approach in #15. One more note on the validity of doing this in the first place vs. just expressing comment information in the RDF graph itself: IMO both techniques have their use and although you'd lose textual comments from an RDF/Turtle document once you put it in a triple store or start processing it strictly on RDF-level, RDF/Turtle files are still often treated like source code, where you add hints and sometimes comment out a section. Last but not least, the comment-based copyright/license header is important for source code license management where many tools just parse comments, but not an RDF document (even though it would be cool to get this info directly from the RDF file using SPDX vocabulary, but I digress). |
I'm no longer working on the project I wanted this for, so feel free to close. |
case in point: a contribution to QUDT with addtional information for reviewers, making it possible to assess correctness: https://github.com/fkleedorfer/qudt-public-repo/blob/ef3e91832691e59e02317ac98c79bdd8eaba2b07/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl#L228 In this case, I want formatting applied and leave the comment as-is, attached to the follwing statement, if reordering must happen (which might well be the case). I see no value in transforming comments to triples. If you wanted a comment triple, you'd write it as a triple. To transform comments to triples is essentially a macro for writing TTL. Possibly that's another interesting project but it's not code formatting. Preserving comments has value for me. Think of unit test input files with information for the maintainer of the file (these triples for testing this, those for testing that; comment something out occasionally while developing the tests - all of these are real examples. |
The custom parser in #15 already gets row/col information for each new Node. What would be missing is an additional parse that only reads comments (with their positions), and then figures out which comments are blocks, and which comments should be associated with which RDF node. All totally doable. |
@fkleedorfer I should have clarified, IMO the only point of having comments reified into triples is to have the information available when writing the model out into a file again; the comment meta information should of course then be converted to proper RDF/Turtle comments again. The reificiation is not necessary just to do this (it could also be kept beside the model), but it would allow for the roundtrip Turtle->parse->write RDF to triple store->retrieve->serialize Turtle. |
If you tackle this, you guys are my heros! You give enough use cases to justify the need. But there be dragons:
|
Yes, this is definitely non-trivial 🐉. IMO this can only be solved in a "best-effort" way, i.e., the solution handles for example the 3-4 well-defined cases and for everything else the result is arbitrarily/randomly/badly formatted. This would still solve >=80% valid cases and would be massively useful, even if it's not a 100% solution. |
agreed - the 80% being that comments are just reproduced in the output as-is if the input was already turtle-formatter formatted (no changes in triple ordering). If there are changes in ordering, comments might not always end up where a human expert would like them, but users really would not expect them to because it's obviously hard in some cases. |
boah... all of you.. I have been dwelling about these issues for so long, and never found anything/anyone even remotely seeing this as an issue ... Side Note: The author of that formatter also maintains the oxrdf/oxigraph rust RDF ecosystem, whichs parsers also don't parse comments, and wrote a completely separate parser just for this tool. Now having found this (thanks to @VladimirAlexiev , thanks you! ), I will hop over and focus on turtle-formatter. To finally get to something substantial: My personal, current favored way to tackle this issue, is also to convert comments to RDF, but to let the user do this manually: If comments are found, issue a warning/error and abort the formatting (this should not be the default for a generic tool like this one, of course). In the warning/error message, give a link to a page that explains/suggests how to convert comments to RDF. <>
a owl:Ontology ;
a spdxt:File ;
schema:comment """
# SPDX-FileCopyrightText: 2024 John Doe <[email protected]>
# SPDX-License-Identifier: GPL-3.0-or-later
""" ;
. because it only does a very simple regex-match of each line for all file-types. Comments to the ontology/file and to subjects are trivial to convert. Section/Group comments can be handled by using Commented-out code can be handled similar to the REUSE example above, adding the comment either to the related subject or to a new blank-node (if the subject it is related to is commented-out itsself, or it is a I also like the Sorry for this long, not very well structured intro. My thoughts and ideas were locked up in my head, with only me working on them for too long, and I have much to learn about possible issues and what the data looks like out there and practicality and so on. I'll hope to be able to do that with you here. Thanks already for all the work you have done! |
Thank you for the kind words @hoijui. |
I really like capturing comments in RDF. Then it's up to the ontology maintainer to dispatch the text to appropriate fields (or delete it). Can you guys recognise commented-out? How? |
@VladimirAlexiev, yes, I absolutely see the advantage of using a proper vocabulary and putting comments in the RDF itself from the start. If we were discussing a "best practices" guide or an ontology governance process I'd be in favor of it (with the exception of the copyright/license block at the top of the file). And this issue is orthogonal to it: if an RDF/Turtle file contains "real" comments, how do we deal with it in order to retain them when the file is formatted. The idea is that when such a model is parsed, the graph will contain the comments reified as triples (using a custom vocabulary), but as soon as it's written to a file, those triples will be written as regular comments again. A user is not supposed to ever see/edit those comment triples. |
Re-serializing comments is possible in turtle and rdf/xml but not in jsonld.
By "user" do you mean the ontology maintainer? Let's have an option to serialize as comments vs as triples. |
Let me try to see if I understood the approaches correctly:
If that is correct, then I think 1 should be the default (however it is implemented internally). Reason: formatting should never alter the data unless the user explicitly requests that. |
To me, there is a 4th option:
|
IMHO this approach belongs in a best practices guide, but is no valid option for a formatting tool: It's not for the tool to decide what authors of valid RDF/Turtle files should do, but "just" work correctly for every valid input. |
Maybe it is a scope issue. One way to look at it is: Then, all functions that do touch content (double formatting, comments transformation), must be done one multiple other tools (a comment triplifyer or a number handler, maybe also a style watchdog). We are packing everything into the turtle-formatter because there is no concept of doing multiple passes over an rdf file. In diffplug/spotless, there is, and it would allow for separating concerns nicely. We could, over time, make that happen. The discussion about comments would not be taking place if rdf parsers retained them in the model. Nobody would expect them to be altered in any way except presentation. |
💯 agree.
turtle-formatter should remain exactly that, a formatter. However, yes, for real-world usage, other tools are required as well. This is another one of the reasons for the planned atextor/owl-cli#20 that will merge turtle-formatter as one tool into a broader scope. Beyond formatting I also envision tools such as code style checks, code smell checks, maybe convenient SHACL validation etc.
You are right; and I still consider the introduction of a Parser with a custom model that retains comments as a valid option (see the second bullet point in my first reply in this issue). |
The warning/error approach explicitly requires a best practice guide, and the message would point to it. This guide would explain how to best (manually) convert comments to triples. I use rust a lot, and there, virtually everyone uses rusts own auto-formatter for each git commit. It is very good, but I have seen it do a lot of mistakes (from a human point of view), all of which have comments as their content. I understand the idea of wanting a tool that just works, but with the sophistication that this tool already has, it is impossible to do without errors. These errors would likely not be noticed any-time soon, unless.. maybe months or years later, someone would read a comment that seems to refer to code, but makes no sense in the place it is, and then what? Most likely the original could not be retrieved, as people tend to use auto-formatters before making a commit (and CIs will likely make that a necessary requirement). Thus it is very important not to make any error, in my eyes. |
Good point.
Also a very good point. The formatter action leading to a problem in these cases is not reproducing comments, though. The root cause would be reordering of statements, which happens, for an unformatted file, once with a big impact, and subsequently only with little impact as changes and additions are made. Compared to not having the option of using comments at all, it still seems the better alternative to me. |
I also think we should have that option, but it should not be the default. By default, it has to be 100% non-destructive. |
Would you a) be willing to have that config option and b) how hard would you think this would be to add?
The text was updated successfully, but these errors were encountered: