-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YAML canonicalization (c14n) #289
Comments
A relevant discussion on one technical question from the linked issue:
|
What does "semantically equivalent" mean in this context? For example, are the following documents semantically equivalent? - some_string
- some_string - &anch some_string
- *anch - some_string
- "some_string" Technically, 1 is different from both 2 and 3, and their representation graphs differ. Still, a lot of apps/frameworks will treat them the same when going all the way to native objects level. For example, 1 and 2 are hard to distinguish if your language has immutable strings, and 1 and 3 will be the same if an unquoted but not special-cased scalar is interpreted as a string (which matches YAML core schema). Is "YAML representation graphs are identical" a useful definition of "semantically equivalent YAML documents" or do you want to go to some higher level?
What step(s) does "parse-serialize" include in terms of the YAML processing pipeline above (spec) - are we speaking about going to representation tree and back?
This requirement seems to only make sense within one programming language as the internal representation will necessarily be different for different languages. Is that expected? Also, would it be okay for c14n to be more restrictive than normal YAML ("no, you cannot canonicalize an arbitrary document without specifying its schema and some information about non-standard tags it uses, if any")? |
I presume that it must refer to the spec's notion of node equality. Under that definition, all three of the the example documents (interpreted under the core schema) are equal to each other. A “canonical presentation” of those documents might look something like this: %YAML 1.2
---
!<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:str> "some_string"
- !<tag:yaml.org,2002:str> "some_string"
...
I think that this would be unnecessary. A “canonical presentation” should present an explicit tag property for each node. Then, there are no non-specific tags to be resolved according to any schema. And it should present the canonical form for each scalar. Thus the “canonical presentation” should have identical semantics under any schema. (This is off the top of my head, so please let me know if I've missed something.) Note that while the example canonical presentation is schema-independent, the three example documents are not. If you had a converter that took in an arbitrary YAML document and converted it to a canonical presentation, then you would have to supply a schema as input — just as you would when loading an arbitrary YAML document. I see two potential issues with aliases. Two documents that differ only in whether certain nodes are aliased have equal representation, so (I presume) we should want them to have the same canonical presentation. This means that we would have to make consistent decisions about when to use aliases when not necessary. The simplest approach would be to never use unnecessary aliases. This seems like the most sensible to me, except that this would make a canonical presentation converter vulnerable to a “billion laughs” attack. A converter could mitigate this by stopping at a certain number of serialization nodes. The next-simplest approach might be to use aliases whenever possible. This might be tricky or expensive to implement. In addition, it might produce surprising results for simple documents. The other issue is node equality for mapping keys. In non-recursive cases, node equality is very simple. In most recursive cases, it's still pretty reasonable. But you can embed arbitrary directed graphs in YAML, so node equality requires solving graph isomorphism, which is NP-complete. Currently, the 1.2.2 spec punts on equality for any recursive nodes (acknowledging that previous spec versions did not fully specify it). Expecting YAML implementations to solve GIP would be a big ask. There are practical algorithms that would probably run well for most documents, but they are not simple, and there's no way around exponential runtimes in the worst case, which would be a security risk. Alternatively, I think that everything would be nice and tractable if we added a restriction that a mapping key may not contain its parent mapping as a descendent. This is obviously not ideal on its face, but it probably would negatively affect very few users and it would allow us to properly specify node equality in a way that libraries could reasonably implement. All this aside, I'm not familiar with YAML-LD, but it might be more practical at this point to define a canonical presentation for YAML-LD documents. If YAML-LD is intended to express the same application semantics as a JSON-based format, then I suspect that YAML-LD canonicalization may be substantially simpler and easier than arbitrary YAML canonicalization. |
So Thinking more about this, I find the current YAML stance on node identity inconsistent. It both declares it unimportant (in terms of equality and in terms of scalar identity - I forgot there was a special case for that in the spec) and treats it as important by providing the way to represent (almost) arbitrary node graphs by identity. IMHO, the two "consistent" stances would be either "node identity is not important, your data structure has to be a tree, aliases are only shortcuts" or "node identity is important, the mapping key equality check should be schema (application) specific". However, that's a question for another day.
This means it's not possible to have a general tool or library to canonicalize any YAML file - you don't know how the tags are resolved in the correct context, so you need to either provide that information out of band or say "that's not supported, just emit the canonical YAML directly, that's the only way". Yes, that's a possible tradeoff, but it should be explicitly acknowledged.
Hmm... looks like this should be enough to eliminate exponential blowup, yes. Quadratic will still be there, I think, but that's less of a problem. |
They are equal according to the spec. There are much thornier cases, such as
I agree that the spec does not adequately nail this stuff down. Identity is mentioned, but it's not referred to anywhere and it's not clear whether applications are intended to take any notice of it. Strictly speaking, whether an implementation constructs equal-but-not-identical nodes to the same native value or different native values is a construction detail outside the scope of the spec (in the same way that e.g. integer precision of native types is), but to stop there would be a cop-out; I think that this is an area in which we can and should provide sensible advice, if not commandments. At the very least, we might provide “rigidly defined areas of doubt and uncertainty”. An interpretation where “aliases are only shortcuts” is not consistent with the representation model. Although an implementation may keep track of aliases behind the scenes, the representation model does not contain alias nodes. If two different collections contain the same child node, then which one of them is the “original” reference and which is the alias is a serialization detail, and it wouldn't survive e.g. reordering keys. “Identity is important” has pitfalls of its own, but it's probably closer to actual implementation behavior. I do think that there is also a self-consistent middle ground where node identity is not important, but a representation is a true graph.
In the same sense, it is not possible to have a general tool or library to load any YAML file. Most YAML documents will be parsed to a serialization that has non-specific tags, which require a schema to resolve. Admittedly, most YAML documents will probably use the core schema, but it was a deliberate design decision to allow other schemas. The ship has sailed on that tradeoff. The key point, I think, is that the fundamental “information content” of a YAML document is its representation graph. Once you have a representation graph, you don't need a schema to construct native types or to compare nodes. I suggest that, rather than thinking of a document as having a canonical presentation, you might think of a representation graph as having a canonical presentation. The loading process for an arbitrary document is schema-dependent, but the dumping process for an arbitrary representation can be schema-agnostic simply by being maximally explicit.
I think that in most cases, it should be linear or basically linear. There are cases where things could still get weird, though: - &a [ [ *a ] ]
- &b [ [ [ *b ] ] ]
- &c [ [ [ [ [ *c ] ] ] ] ]
- { *a, *b, *c } Under one reasonable definition, all three nested sequences should be equal, but proving this could require a number of comparisons equal to the LCM of the depths of the lists. Definitely a question requiring serious thought. |
I say they are all different.
|
I don't think so. Like I explained above, the data structures loaded from those are different. |
I think we agree; I'm hair-splitting in my typical way. Just to be maximally clear/explicit:
The question of whether the three examples are “semantically equivalent” is basically a question about what “semantic equivalence” means. It is likely that some implementations may construct identical nodes into a single native data structure, and non-identical nodes into different native data structures. This is not an unreasonable thing to do. If by saying that X and Y are “semantically equivalent”, we mean that implementations should treat them the same way, then we're saying that X and Y may be equal as representations, but not semantically equivalent. Originally, I thought that representation equality should be a sufficient definition of semantic equivalence for the purposes of this discussion. But having considered the practical concerns of reference identity in construction, I agree that a strong notion of semantic equivalence has to respect identity in some way — that it must be a stronger condition than mere representation equality. That said, I still think that a “canonical presentation” should be understood as pertaining to a representation graph. Perhaps one way of expressing this is that two representation graphs should have the same “canonical presentation” if and only if:
So the “canonical presentation” should not add or remove anchors/aliases (for collection nodes), but should leave the alias ”structure” unchanged. However, it may relabel any set of anchors/aliases, and it may rearrange mapping keys in a way that changes which serialization node is the anchor and which serialization nodes are aliases. So the following two documents (under the core schema) are “semantically equivalent” and should produce the same “canonical presentation”: foo: &a [1, 2, 3]
bar: *a bar: &b [1, 2, 3]
foo: *b But the following document is not equivalent to the above and should produce a different canonical presentation, despite having an equal representation graph: foo: [1, 2, 3]
bar: [1, 2, 3] I don't have a strong opinion as to whether the spec should define a notion of canonical presentation, but the underlying principle would seem to apply to the serialization process in the general case: if identity of collection nodes is semantically significant, then we should require that the serialization process respect this and not arbitrarily add/remove anchors and aliases (to collection nodes). To the extent that this may already be implied by the spec, I would support making it explicit. |
@Thom1729 then we have a different definition for the representation graph. |
If two YAML streams are semantically equal, they can be serialized as the very same string. |
Sorry, I was a bit unclear. The nodes are "only shortcuts" in the meaning that logically they're exactly equivalent to repeating the anchored part of the tree in the place of the alias. Yes, that would be a noticeable limitation compared to the YAML of today.
I thought it was the other way round - you can go from the representation graph to the character stream and back without a schema, but need a schema to map the graph to the proper native structures. Is the representation graph supposed to contain already resolved tags, not yet resolved tags or can that vary based on how you get the graph?
I've been thinking worst cases... and I think that if you have N sequence nodes containing each other in various weird ways and have a mapping containing every one of them as a key (and thus have to compare whether the structure of this "knot" looks the same from any two different starting nodes), that's gonna be more like N^2. |
@perlpunk Sorry, I think I'm describing what I mean confusingly. The representation graphs are “equal” according to the definition in 3.2.1.3. Node Comparison (which is used for mapping key uniqueness), but they are not the same, and the difference is significant for exactly the reasons you mention. It's unfortunate (and confusing) that two different, non-interchangeable representation graphs may be “equal”. The existing definition of “equal” is needed for mapping key uniqueness (and possibly for other purposes), but we might want to rename it to something like “loose equivalence” to make it clear that it does not mean that two nodes are totally interchangeable.
Generally, dumping is a process where the implementation makes arbitrary choices (node style, indentation, anchor names, etc) and loading is a process where the implementation throws those choices away. Composition is a sort of exception where you need to know what schema the document author or serializer had in mind. The parsing process takes a character stream and turns it into a serialization tree. In the serialization tree, some nodes may have the “non-specific tags” The serialization stage is the reverse of this: a serializer may use a schema to turn some specific tags into non-specific tags. But it doesn't have to: it can leave every tag fully specified. If it does, then the resulting serialization could be composed without a schema, because there are no questions for a schema to answer. So you can take any arbitrary representation graph and serialize it without a schema (that is, leaving all tags specific), and present it as a character stream, and then parse it and compose it without a schema (because there are no non-specific tags to resolve). But you can't take an arbitrary text stream and parse/serialize it without a schema, because its serialization tree may contain non-specific tags that need to be resolved using a schema. You never need a schema to construct native types from a representation graph. The constructor sees only the representation graph, with its fully resolved tags; it doesn't know or care what schema was used to compose that representation graph. |
Does this mean that if I load the same document into two different native data structures, from the YAML point of view that looks like building two different representation graphs (with the same node contents, but different tags) and then constructing native types for both of them? |
if I load the same document into two different native data structures
I'm not sure what you mean by this, can you clarify?
…On Thu, Jul 7, 2022, 18:29 Denis Lisov ***@***.***> wrote:
You never need a schema to construct native types from a representation
graph. The constructor sees only the representation graph, with its fully
resolved tags; it doesn't know or care what schema was used to compose that
representation graph.
Does this mean that if I load the same document into two different native
data structures, from the YAML point of view that looks like building two
different representation graphs (with the same node contents, but different
tags) and then constructing native types for both of them?
—
Reply to this email directly, view it on GitHub
<#289 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEBMII3372A37N7277F3XFTVS5K3HANCNFSM524L3YBQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
In Rust with the let just_vectors: Vec<Vec<String>> = yaml_lib_name::deserialize(data)?; // just any list of lists of strings will go
let list_of_triples: Vec<(String, String, String)> = yaml_lib_name::deserialize(data)?; // the nested vecs have to be length 3
let list_of_structs: Vec<MyStructName> = yaml_lib_name::deserialize(data)?; // "tuple struct" serializes as a list too Do these three deserialize calls logically build different representation graphs (with different tags) from the YAML point of view? Coming back to our topic: would canonicalization support for this API be expected to invent some names for these implicit tags and include them in the serialized version when serializing these structures? |
As an author of data structures in YAML, I consider the change from non-aliased to aliased nodes (or back) a significant edit. I'd like to point out, that even (changed) anchor/alias identifiers can convey meaning. Maybe step back a little from the problem and come back to the motivating cause?
This compares to easier code review when the code is always (automatically) formatted the same way. Maybe try another analogy for anchor/aliases: symbolic file links (symlinks)?
But this analogy fails in the aspect that the anchor/alias identifier is neither the symlink name, nor its target name -- both being vertices in the file system graph -- but rather a name for the edge. |
In a word, no. Construction is the process that turns a representation graph into native types. This process varies widely among implementations, partly because different implementation environments have different native types. But even a particular implementation may support different construction behaviors. For example a JavaScript YAML library may allow you to specify whether scalar nodes with the So if I understand the code sample correctly, then each line will cause the YAMl library to parse the raw YAML text into a serialization tree, compose it using whatever schema is the default (presumably the core schema) into a representatinon graph, and then construct the representation graph into the indicated native types for each line. In all three lines, I would expect the parsing and composition to be the same, for the implementation to use the same schema, and for the representation graphs to be fully equivalent. Note that the above is a conceptual description. It is possible that the internal implementation of the library may not strictly follow all of the steps in order. This is fine as long as it ends up with a correct result in the end.
I'm not sure exactly what you mean by “implicit tags”. Suppose we have the following document: - [a, b, c]
- [1, 2, 3] The document has “implicit tags” in the sense that the presentation does not have explicit tag properties in it (like !<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:str> a
- !<tag:yaml.org,2002:str> b
- !<tag:yaml.org,2002:str> c
- !<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:int> 1
- !<tag:yaml.org,2002:int> 2
- !<tag:yaml.org,2002:int> 3 The above is what a “canonical presentation” of that representation graph might look like — which is to say, it is a “canonical presentation” of the first document as interpreted under the core schema. If the first document were interpreted under a different schema, then it might result in a different representation graph, which would have a different “canonical presentation”. But in the “canonical presentation”, all tags are completely explicitly specified, so it would produce the same representation graph under any schema. All of this is separate from the question of what native data structures might be constructed from the representation graph. The representation graph doesn't know anything about Rust's tuples or vectors. All it knows is that the sequence nodes all have the tag |
Yeah, I think we have a consensus on that.
It may be the intent of the author to convey meaning that way, but the YAML data model treats alias/anchor identifiers as a serialization detail. These identifiers are not part of the representation graph and a YAML processor may freely add anchors to nodes, remove anchors from nodes, or rename anchors and aliases, as long as this does not change the structure of the document (including which nodes are aliased to which other nodes).
In a file system, link names are significant, but in YAML, anchor and alias names are not significant (except insofar as they encode the structure of the document). Also, YAML aliases are like hard links, not symlinks — the serialization node with the anchor isn't the “real” one, it's just the one that happens to appear first in the serialization tree. If I had to encode a filesystem in YAML, it might look something like this: aFile: &a !file Hello, World!
aDirectory: !directory
anotherFile: !file Goodbye, World!
hardLinkToAFile: *a
symlinkToAFile: !symlink /aFile The paths Does this make sense? |
Yes, thanks for the clarification. As to the anchor/alias names not being significant, I understand that this is according to the current YAML specification. Coming back onto the original motivation:
If (part of) the documentation / specification of a data structure is transported in the comments and/or anchor names, these details become unchangeable regarding signing and verification[1]. This problem is solvable through signing the textual representation. If only the behavior of the system that picks up the YAML data is significant, the problem is highly dependent on the application(s) -- and cannot be solved at YAML specification level. What exactly shall be solved by c14n on an intermediate level? I think that goal ought to be better defined. Or maybe, there is a spectrum of grades of YAML equality, and we need a multivalued (if not multidimensional) c14n. [1] https://en.wikipedia.org/wiki/Verification_and_validation |
For use cases like cryptographic signing, reliable diff, Verifiable Credentials:
it would be useful to have a defined algorithm for YAML canonicalization/normalization (c14n).
I.e. when c14n is enabled:
c14n is addressed for various information standards:
Links:
The text was updated successfully, but these errors were encountered: