Skip to content
This repository has been archived by the owner on Aug 25, 2022. It is now read-only.

Revision support - Possibly move to rdf_entity 2.x #83

Open
idimopoulos opened this issue Feb 7, 2019 · 5 comments
Open

Revision support - Possibly move to rdf_entity 2.x #83

idimopoulos opened this issue Feb 7, 2019 · 5 comments

Comments

@idimopoulos
Copy link
Contributor

After a discussion today with @sandervd and @brummbar we came up with the following template/suggestion so that remarks can be added and a complete solution can be created.

Purpose of the issue: Support revisions

Current state and problematic pieces

A bit of history

Rdf entity provides a layer to support storing entities directly in the triplestore. The idea behind it is that every property of every field has a predicate URI mapped to it and this is used as a storage identifier for the database. Properties without a mapped URI do not get stored in the database and are simply skipped.
The other major factor of the module is that it uses graphs in order to store the entities separated by bundle. That means that each bundle has its own graph, rather than each entity type. This approach, while seemed nice at first, is not ideal as one cannot enforce that all objects (entities) that have a "specific schemantic meaning" will live under graphs split by their type.

Where the problems start

The last major factor - Triplets

The last major factor in our decisions is the triplestore (or quadstore as we use it) itself. A triplestore means that everything is described using triplets. This has advantages and disadvantages but what is really important for us is that it lacks a bit of flexibility against SQL in terms that you cannot have more than 3 "columns" to describe something.
For example, in SQL, for each field, you have a table where each entry stores the entity_id, the revision_id, the delta, the value and other properties required by each field. That means that a structure is created and you store one entry for each delta of each revision of each entity.
In SPARQL however, you need to find a way to do this in triplets without breaking the structure of the entity, allowing to query properly (so no serialized cheats) and without breaking the ontology (keep a predicate per property). That means that you had to do something like

entity1 field1 en-uk
entity1 field1 0
entity1 field1 test_value
entity1 field1 en-us
entity1 field1 2
entity1 field1 test_value2

Only by looking the above, the problem already exists since if you query for the field1 of entity1 you already don't know which property belongs to which delta. The sequence of the data stored is also not a way to distinguish as triplestore does not return results or stores them as you give them.

While the delta specific problem is for another issue, this remains here for the following sections.

The string ID

One of the thing that makes the scemantic web so appealing is the identification of its objects (or entities) by a unique URI. For us, that means that unlike nodes, we are using URIs for identifying the entities. This also brought up many issues in the past, as many modules did not yet support the string IDs but we got over that. Why is that important here though?
Back a year and a half, we also had to somehow support multiple versions in the Joinup project. As it is normal, the idea to support revisions just like core does was one of the ideas. However, there were a few issues here:

  • The fact that we are using a graph per bundle, means that all properties were to reside under the same graph.
  • Scemantic web is a way of describing entities. For the triplestore (or quadstore) that we use, that means that each property can have a triplet of properties to describe it. As it is described in the previous section, that means that since all versions of the entity would share the same URI, the entities themselves would have no way of distinguishing which property belongs to which version.
  • Different IDs cannot exist for the same entity so an entity with ID http://example.com/rdf/1 cannot automatically have http://example.com/rdf/1/version/2 as the implications are multiple. The same id can be the same id of another entity (not a revision) and the queries would be a nightmare if we had to concatenate ids.

Rdf Draft

That is when the rdf_draft module came into play. The need in Joinup was that only up to two revisions can exist at any given time, a published and an unpublished one. Since the need for a history of changes was not a requirement, the solution came with the graphs themselves.
For each bundle, a second graph was created, separating the two entities and giving the option for a publication status on the entity. For us, those graphs took the form of http://joinup.eu/<bunlde>/[published|draft].
This is already a solution to many of the needs that might come up but that also came up with some limitations:

  • The fact that we were already split bundles of an entity type in graphs means that we have to take care of many parameters when we try to perform CRUD operations on entities residing in specific states (i.e. we don't only query on entities in specific states).
  • Workarounds had to be implemented also for other cases like supporting search_api natively.
  • Limitations of states of an entity. While it covered the cases where 2 versions exists at the same time, it took a lot of work to only support once more, and while not the same effort is needed for any subsequent version, it still requires manual work in terms of development and quite a good understanding of the overall module in order to add any more. Even enabling the draft version requires a bit of manual work and understanding.
  • The number of versions is finite. The idea that we are manually adding a new graph for each new state that we want our entity to have, means that we can only create a specific number and that this number is not automatically scalable. Revisions is a no go for this module.
  • Split of the entity. Manual intervention is needed in order to query all versions. All graphs must be added in the query to allow querying them all. And that is regardless of whether a field is used like the moderation status field.

Revisions

The idea behind supporting revisions involves a few ideas, parameters and a couple of compromises:

  • Since we are going to support such a major issue, it has to be done in a way that is split from the main module, so that the main module can work independently.
  • We are going to try and mimic the node revisions system as much as we can so that we can follow best practices as well as make it a bit more understandable to users.
  • We are still going to split entities under specific graphs, however, we are going to make it simpler

Drop of the rdf_draft module

The rdf_draft module is a nice implementation but should be a legacy of the past. Apart from the fact that a lot of issues have come up due to the multiple graphs we need to support, it will directly conflict with the implementation of revisions.

Graph structure

Following the problems above, we are going to drop the support of a graph per bundle and follow the notion of the Drupal entities in version 8. Each entity type has a table which stores the base fields of the entity. Unlike Drupal, however, we are going to have everything within a specific graph.
Without addressing the revisions yet, that means that every entity type will only have one graph to look into for anything.

Revisions

Revisions, like rdf_draft, will reside in a separate module. That would require the need that the storage class (or entity class) will be overridden by the new module in order to support new methods like the ::allRevisions() (corresponding to the ::allRevisions() from the NodeStorage class).
However, since we are trying to split the rdf_entity module already, we can use this module to simply include an interface and a revision trait for each entity type that is defined and wants to use the revision system.

How issues will be addressed

For all the above issues, during the discussion we came up with the following structure details.

Revision graph

The revision graph will also belong to a specific entity type, it can be defined in the annotation of the entity type or in the mapping entity that we currently support. Each entity type's graph will be solely for internal use and should not be exposed if there is an exposed endpoint.
The name of the graph is user defined.

Identification of the entities

The revision graph will be a pool of data from all revisions of the entities. As nodes do, even the current revision will exist in the graph.
Since we define that revision graphs are solely for internal usage, the IDs of the entities can be arbitrary and different from the original IDs. This gives us the ability to create IDs like http://<random alphanumeric string>.com/<entity type>/revision/<revision serial id>. The serial number can be global or per entity. If global, since triplestore does not have serial numbering, it has to be stored in Drupal or be determined on the fly when a new revision is stored. The later is a better solution solely because migrating data will not break the structure.

Connection with original entity

The idea is that the revision entities will use a property like the dcat:isVersionOf to link to the original content. Possible implications here is that the original entity might already have a property mapped to the dcat:isVersionOf predicate so probably another property might be used. Something like <base_url>/drupalIsVerionOf.
Additionally, the revisions sub module can define to all entity types that have a revision graph defined an additional base field mapped to something like <base_url>/drupalRevisionId which also maps back to the current revision ID.

Additional properties

Every revision should include the following properties apart from the drupalIsVersionOf:

  • Revision ID. The idea is that this is the serial number that will also be used to construct the revision url.
  • Revision timestamp. This will be used to determine when the revision was created and the order of the revisions.
  • Revision updated. As with node revisions, this can be used by constraints to determine if the revision can be edited or another revision has been updated more recently.

Drop support of states

RDF Draft enforces the idea of states within the rdf_entity module. However, the revisions is not necessarily the entity in a different state rather than a history of it. Further support by states can be attained but this will be irrelevant to the rdf entity structure and only relevant to the corresponding status field.

Conclusion and Compromises

  • While the idea to limit objects of a certain meaning under a specific graphs is already marked as not ideal, we can, in an organizational level, decide so. Keeping entities of the same ontology within a certain graph certainly could not always be the case but for sure keeps content a bit more organized and makes it more flexible to query and use.
  • With these changes, we might need a new version. If so, it might be tough to have an upgrade path from version 1 to version 2 and Joinup might be stuck with version 1 at least for a while.
  • The above description surely is far from what we have currently in Joinup but would also solve all issues that we had to support in the past in Joinup.
    ** All queries can be supported by default.
    ** Entities existing in the published graph can be simply moved over to the main entity type graph and the update path is complete.
    ** Enabling revisions only require to copy over the version of the published graph over to the revisions graph.
    ** Upgrading from rdf_draft only requires to create a new revision in the revisions graph.
    ** Support to the federation is easily achievable by having a new adding a state in the status field which is 'federation' (other statuses are published or unpublished, this is not about the state_machine state field we are using). Entities with the federation status are simply prone to becoming the new revision of the entity.
@sandervd
Copy link
Contributor

sandervd commented Feb 7, 2019

Thank you for your thoroughness @idimopoulos !

@Roensby
Copy link

Roensby commented Jun 24, 2019

Very interesting read. I was considering an alternative approach (which is nowhere near as thought through as yours), involving the use of the Drupal database to store revisions.

Apologies for my poor use of terminology, I'm still learning the language around RDF.

My proposal makes a few assumptions:

  • RDF triples are normally changed (revisioned) with a similar frequency as other entities (such as nodes) and not a lot more.
  • Revisions are only relevant when looking at a single RDF for reasons of restoring that revision.
  • To expand on that point, people normally don't want revisions to be queriable via Sparql, they just want changes to be tracked and revertable (that is my use case).
  • Further, revisions being a reflection of the entity model, would not track the RDF relations that are available within the triple store.

With regards to implementation, a derived module of the triple storage module and a version of the RDF_draft module could allow for the current version of the RDF entity to be written to the triple store, and revisions to be written to the Drupal database (for example using the entity model and only using the triple store when reverting to a previous version).

This solution would not allow for inspecting an old version of the RDF graph, and wouldn't track RDF relations (as far as I understand the implementation in Drupal) but would be relatively simple to implement.

@sandervd
Copy link
Contributor

Hi @Roensby,
I did start a proof of concept for implementing a revision system a while ago, you can find the code related to that in refactor-graphloader branch.

This PoC creates a new 'RDF entity' (collection of triples with the same subject) for each revision.
Each revision has a reference to the same 'published' RDF entity, with a fixed predicate.

I had a similar idea as yours at some point, but the necessity for being able to query revisions made me take the other path.

I was thinking to simply serialize the entities on disk as ttl files, and not even put them in the database. Simply giving them incremental filenames should be enough, in the end if you can't query on the data, I didn't see much point in putting it in a db. You could come up with a way of attaching some metadata, so a db could make sense for some use cases I guess.

I'm very curious to learn about your use case though!

@Roensby
Copy link

Roensby commented Jun 24, 2019

Hi @sandervd, thanks for engaging. I'm going to check out your poc.

A little background for my use case: traditionally, Drupal uses RDF on a per-field basis. For example, a node can have an author field field_author to designate one or more authors (typically as either strings or entity references). The Drupal RDF module can then attach a predicate to the field (such as schema:author). The implication is that the node is the subject and the author string is the object.

In my opinion, this way is old-fashioned because it requires constant changes to the data model (in this case a node), which creates extra work in a decoupled setup where Drupal is just the content layer and something else provides the presentation layer (for example).

Instead, I use a metadata field (e.g field_metadata) to store entity references to entities that implement RDF triples. This allows me to eliminate speciality fields such as field_author and keep my data model nice and clean. It also helps the other team (in a multiteam context) that works on the presentation layer, because they can treat the metadata field as a transport mechanism for whatever metadata is relevant for their needs.

There's obviously more to it, but treating my metadata this way allows me to integrate Drupal with a triple store in a conceptually simple way.

The rdf entity module seems like a perfect fit for this use case. However, because my nodes already take care of versioning the most important relation; what triples they reference through the metadata field, I don't particularly need the triple store to remember past relations.

For my use case, it seems more prudent to track the actual content of the RDF triples (such as changes to the author name). For this, Drupal built-in revision support using SQL tables may be sufficient.

Btw, I commented to vocalise my idea with the intention of writing the necessary code myself. I'm still not convinced of the viability, but on the face of it, my proposal seems technically (if not conceptually) much, much simpler than implementing custom revision support on top of a triple store.

I hope it makes sense, and again, thanks for engaging!

@sandervd
Copy link
Contributor

sandervd commented Jul 2, 2019

Hi @Roensby,

You could also have a look at this project: http://d2rq.org/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants