Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make projected dataset in DynamicDataset accessible #1557

Closed
Aklakan opened this issue Sep 28, 2022 · 15 comments · Fixed by #1558
Closed

Make projected dataset in DynamicDataset accessible #1557

Aklakan opened this issue Sep 28, 2022 · 15 comments · Fixed by #1558
Labels
enhancement Incrementally add new feature

Comments

@Aklakan
Copy link
Contributor

Aklakan commented Sep 28, 2022

Version

4.7.0-SNAPSHOT

Feature

As a follow up to the performance issue encountered with Counting all triples performance and named graphs I created an assembler + syntax transform that pushes FROM (NAMED) into filters over graphs, such as illustrated below:

SELECT * FROM <a> FROM <b> { ?s ?p ?o }
becomes
SELECT ?s ?p ?o { GRAPH ?g { ?s ?p ?o } FILTER (?g IN (<a>, <b>)) }

Of course this is a non-standard interpretation because it misses the RDF merge / DISTINCT operations mandated by the SPARQL spec. However, for large datasets having the possibility to explicitly enable this interpretation via a custom assembler backed by a custom (QueryEngineFactoryFromAsFilter, DatasetGraphFromAsFilter)-pair seemed reasonable.

The issue is, that Fuseki clears the from (named) clauses in SPARQL_QueryDataset.
So my QueryEngineFactoryFromAsFilter never gets a chance to perform the rewriting.

The relevant snippet is:

public Pair<DatasetGraph, Query> decideDatasetDynamic(HttpAction action, Query query, String queryStringLog) {
        DatasetGraph dsg = getDataset(action);
        DatasetDescription dsDesc = SPARQLProtocol.getDatasetDescription(action, query);
        if ( dsDesc != null ) {
            dsg = DynamicDatasets.dynamicDataset(dsDesc, dsg, false);

My impression is that DynamicDatasets.dynamicDataset is NOT needed at this place. As I expected, I see that QueryEngineBase and QueryEngineTDB already call DynamicDatasets - so is there a reason to also have this in Fuseki?
I locally changed the method to only push the protocol graphs (if given) into the query (and thus let the QueryEngineFactories handle the rest) and there are no failing tests.

Are you interested in contributing a solution yourself?

Yes

@Aklakan Aklakan added the enhancement Incrementally add new feature label Sep 28, 2022
@afs
Copy link
Member

afs commented Sep 28, 2022

This is not a good idea. There is no guarantee the downstream query engine does support dynamic datasets. Just because the main implementations do is not a contract. For example, union graph handling is different.

Putting all transformation in one place is better. That is what clearing the query is doing - signalling that here is no further work.

The correct way is to add a custom SPARQL_Processor (inherit from SPARQL_QueryDataset) if you want different behaviour.

I don't see why you want to intercept the query - why send the wrong query in the first place?

@afs
Copy link
Member

afs commented Sep 28, 2022

DynamicDataset work looks reversible.

  1. It is a specific DatasetGraph implementation and you can get the original.
  2. Changes are recorded.
        // Record what we've done.
        // c.f. "ARQConstants.sysDatasetDescription" which is used to pass in a DatasetDescription
        dsg2.getContext().set(ARQConstants.symDatasetDefaultGraphs, defaultGraphs) ;
        dsg2.getContext().set(ARQConstants.symDatasetNamedGraphs,   namedGraphs) ;

But it is still the case that some day you will want to intercept Fuseki query processing - working with a collection of ARQ tweaks is restricting. Changing things one-by-one is more work overall, and isn't guaranteed to be possible.

Put the extension in and have a single place to alter the system behaviour and make life easier for your group.

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 28, 2022

I don't see why you want to intercept the query - why send the wrong query in the first place?

Performance

The performance degradation introduced by using FROM clauses on our large datasets varies from "noticeably slower" via "unbearably slow" to "doesn't work anymore". For example, the aforementioned counting of triples no longer returned results.

It's a certainly an ugly tradeoff - but between "strictly standard conforming result set but unbearably slow" and "usually same result set (unless there are duplicates in the union'd graphs) in a reasonable amount of time" the latter is the practically relevant one our partners want to see. Especially, we know that our graphs are disjoint.

Upcoming Use Case: Graph Groups

Furthermore, a general requirement we now have in one of our projects is to allow queries need to be able to refer to
"virtual" graph names in the FROM clause which can be transparently remapped to the physical ones.
This is essentially Virtuoso's graph groups where FROM (NAMED) clauses can
be "macro expanded"; e.g. FROM <x> is conceptually turned into FROM <a> FROM <b> ....

It is very easy to implement these kind features on ARQ level. All that would be needed is a way to pass the original dataset graph and dataset descriptions from Fuseki down to ARQ.

Of course things would have to wired up in more sophisticated ways with Fuseki if one wanted to add user-sensitive rules. E.g. when authenticated as user X then graph G maps to (A, B) and as user Y graph G maps to C. But that's not (yet) what we need.

DynamicDataset work looks reversible.

The wrapped original dataset is not accessible (the "projected" attribute) :( (unless using brittle reflection).
I would also be fine if unwrapping the orginal Dataset and DatasetDescription via DynamicDatasetGraph was possible.

public static class DynamicDatasetGraph extends DatasetGraphReadOnly implements DatasetGraphWrapperView {
    private final DatasetGraph projected;

    public DynamicDatasetGraph(DatasetGraph viewDSG, DatasetGraph baseDSG) {
        super(viewDSG, baseDSG.getContext().copy());
        this.projected = baseDSG;
    }
}

@afs
Copy link
Member

afs commented Sep 28, 2022

The wrapped original dataset is not accessible

That would be a small change if it does not break AccessCtl_SPARQL_QueryDataset and around that area.
Via a context setting so it can be undone maybe. Haven't investigated.

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 28, 2022

Any approach that'd allow us in the future to just drop in our plugin-jar-bundle into a vanilla Fuseki and by doing so also have these dataset-description-query-rewriting features ready for dispatch via assembler configuration is very appreciated!

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 28, 2022

I just released that with SERVICE requests one only gets the Op representation which means the dataset description is lost. Therefore, an unwrappable DynamicDatasetGraph might be the more powerful/flexible approach.

@afs
Copy link
Member

afs commented Sep 28, 2022

Upcoming Use Case: Graph Groups

have you considered GRAPH - which is what other systems use? It is also in the algebra.

Now saying there is a bigger agenda makes it impossible to assess incremental changes because there is now every chance they will be become irrelevant and, if across a release, tech debt.

The Jena project has no idea who "your partners" are nor what their agenda is.

There is a public community. That is all. Use case and requirements become out of date. Adding features in the core system that then don't get used or aren't understood is technical debt. Just keeping the codebase going takes a level of resource.

The performance degradation introduced by using FROM clauses on our large datasets

that depends on the datasets.

what is the real cause?

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 28, 2022

My intent of mentioning "graph groups" was not to wave around with "here is a bigger agenda" but rather: here is another example of a quite well known feature that could be implemented on top of it because the essence of it is also modifying a query's from/named clauses.
The intent was to make the case stronger - not to weaken it. If my statements in that regard create confusion then please forget I ever mentioned "graph groups".

  • have you considered GRAPH - which is what other systems use? It is also in the algebra.

Maybe this wasn't clear: We want to use GRAPH internally but we don't have control over the query. It comes with FROM! If it weren't for the FROM then Fuseki wouldn't intercept it and we could just freely rewrite it to our liking in ARQ and we'd have saved all the discussion!

@namedgraph
Copy link
Contributor

Can't you rewrite the query instead of rewriting Fuseki? :)

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 28, 2022

If there just was a SPARQL server that supported rewriting queries via custom ARQ-based query engines...

Seriously, I really don't know how to get the point across that with Fuseki its already possible to rewrite ANY query UNLESS it makes use of FROM (NAMED). I am asking to lift this restriction.

Aklakan added a commit to Aklakan/jena that referenced this issue Sep 28, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Sep 29, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Sep 29, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Sep 29, 2022
Aklakan added a commit to AKSW/jena that referenced this issue Sep 29, 2022
Altered DynamicDataset for From-As-Filter
@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 29, 2022

For my remarks on making DynamicDataset unwrappable and a minor issue about stability of the graph-order see here: #1558 (comment)

@afs
Copy link
Member

afs commented Sep 29, 2022

The question was why is the wrong query being asked in the first place. You say you have no control over the query. Where is it coming from?

The implication I take from all this is that the motivation is "to be like Virtuoso". That implies all Virtuoso features are replicated. Not point-by-point. It'll end up going round in a loop next PR.

Virtuoso is RDF 1.0. It has SQL expression evaluation semantics.

possible to rewrite ANY query UNLESS it makes use of FROM (NAMED)

There is no exception. The operation handlers in the server for all and any functionality can be replaced. They are not special; they are the default registrations. Nothing in Fuseki is hardwired except for the dispatch process.

Any query can be transformed in Fuseki. Add a custom operation implementation and handle the incoming request.

SPARQLQueryProcessor provides most of the work if you want to use that; it is not mandated you do use it. It has several extension points.

SPARQL_QueryDataset is very short - the only thing it does is decideDataset - so alternative choices can be implemented and still reuse the bulk of the query execution machinery (and note getDataset - the dataset found during routing is available).

Specifically to dynamic datasets: it records the original dataset description so all query information is available.

We still don't know where costs are coming from - only "our datasets" and "FROM" which is way to far away to point to the code in question. The fact you are counting (counting is optimized for TDB) matters.

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 30, 2022

Specifically to dynamic datasets: it records the original dataset description so all query information is available.

Yes, maybe you haven't seen it, but I changed the PR to only make the private "projected" (original) dataset accessible.
That's this only piece of information I am missing in order not having to bother you with this issue anymore.
DynamicDataset is also in ARQ - this way Fuseki doesn't have to be touched at all.
I can change the issue/PR titles because now they are misleading.

Any query can be transformed in Fuseki. Add a custom operation implementation and handle the incoming request.

That's technical dept on our side:

  • It's not a good solution because the rewriting is only an ARQ-level operation which shouldn't escalate to/depend on the server. Checking for whether a Dataset is dynamic is also independent from Fuseki because fortunately DynamicDataset is in ARQ.
    Conversely, an ARQ plugin doen't need to check for whether a Dataset originates specifically from Fuseki.
  • With the way ACL is implemented in Fuseki, our rewriting based on DynamicDataset would (coincidentally) still work. Conversely, we don't want the prospect of potentially having to subclass every possible servlet when the task at hand is unrelated to them.
  • Right now, we have an ARQ plugin bundle which we just copy into the Fuseki extra folder. If we touched the servlets we'd need a Fuseki-plugin bundle (or would we even have to maintain our own Fuseki fork?).

Virtuoso [...]

Let's leave it out because it adds nothing to the discussion.

Where is it coming from?

Apparently I am biased, but among my peers it's a common design pattern to have some form of query (template) catalogs that reference the target graph with FROM (typically a single one).
For example, there is a process where initially a query uses FROM <latestEvents> which gets then transformed into FROM <events2022-09-30>.
Furthermore, multiple version of the same query (FROM vs GRAPH) is not acceptable because it becomes unmanageable quickly so there must be automatic transformations whenever possible! Within the processing pipelines the FROM clause can then be easily remapped to the final locations - much easier than GRAPH (in the worst case this could even be a regex hack).

And this leads to the issue of slower requests in Fuseki+TDB2 due to DynamicDataset wrapping.

  • One solution for that is to (hard) code a specific rule that prevents the DynamicDataset wrapping when there is a query with just one FROM clause against TDB (considering all possible corner cases). If you want to add specific code for that case then that would help us too:

If I am not mistaken, then under certain conditions FROM and GRAPH are equivalent (w.r.t. to a given context; active graph and such), such as this:

SELECT (COUNT(*) AS ?c) FROM <x> { ?s ?p ?o }
SELECT (COUNT(*) AS ?c) { GRAPH <x> { ?s ?p ?o }
  • The other solution - the one that I favor - is to have a general way to rewrite queries with FROMs that originate from Fuseki with ARQ plugins without having to touch Fuseki. This way I can experiment with FROM-to-GRAPH rewriting independently and without stating that I will look into improving the counting issue but still have the opportunity to provide a solution as an ARQ plugin bundle that works with vanilla Fuseki in a more future-proof way. If it turns out to be of general community interest then we can still have a discussion integrating the rewriting at a later stage (or maybe by then you already solved it). Yes, it would also allow one to experiment with FROM expansions but this is not a relevant discussion now.

As it stands, it seems to me that making the projected dataset in DynamicDataset accessible would be the best solution.

Aklakan added a commit to Aklakan/jena that referenced this issue Oct 1, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Oct 1, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Oct 1, 2022
@Aklakan Aklakan changed the title Remove FROM (NAMED) handling from Fuseki Make projected dataset in DynamicDataset accessible Oct 1, 2022
@afs
Copy link
Member

afs commented Oct 1, 2022

Outstanding questions:

  • Is the data in TDB? Or externally stored data?
  • What is the actual slow part?

There are two potential costs in GraphUnionRead:

  • it works in nodes when 2+ graphs
  • it does general distinct that materialised nodes

When all the graphs are in a single TDB (either TDB1 or TDB2):

SELECT (COUNT(*) AS ?c) FROM <x> FROM <y> { ?s ?p ?o }

can be addressed by pushing down the work into TDB and not using the general purpose GraphUnionRead. It will get the right answers. This isn't easy - something will need to change because now we might multiple quad filtering in one access. You'll need this anyway later on.

You may wish to review the requirements that Vilnis Termanis. Not the same but related - maybe there is a single solution.

Aklakan added a commit to Aklakan/jena that referenced this issue Oct 2, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Oct 2, 2022
@Aklakan
Copy link
Contributor Author

Aklakan commented Oct 2, 2022

  • Yes, the data is in TDB
  • As you said, if there are 2+ FROM clauses a general distinct is made loading everything into in memory. Otherwise I suspect that the reason is the overhead of materializing the NodeIds into Nodes. At least, so far I haven't spotted any other suspicious code.

can be addressed by pushing down the work into TDB

Yes, the core of this issue and the PR are about having a general path for pushing the raw FROM clauses and the raw dataset configured in the assembler from Fuseki down to ARQ. I don't think this has to be TDB specific (e.g. a servlet specifically for TDB).

With the current interfaces, the options for doing so seem to be reasonably limited to dataset, query or context. And I agree with your argument that components beneath Fuseki should receive a dataset that per-se only exposes the right amount quads w.r.t. to the protocol and security, as this makes it harder for other components/plugins to accidentally leak information.

The final option I see here is having a flag in the dataset context that gets picked up by Fuseki and disables its dynamic dataset wrapping - thus passing the dataset and query on as-is. But this might again leak data when additional wrappers are involved which unintentionally incorrectly pass on that context attribute.

The remaining option is to also put the projected dataset into the context just like the named graph and default graphs. But I think it makes sense to have DynamicDataset as the central point for accessing this information (in the PR I added getters for them).

This isn't easy

Yes, but - as said - I think the first step would be having an easy path from Fuseki to ARQ.

requirements that Vilnis Termanis

Which requirements? It seems you forgot to post a link.
[update]: I suppose its about #1291

Aklakan added a commit to Aklakan/jena that referenced this issue Oct 3, 2022
Aklakan added a commit to AKSW/jena that referenced this issue Oct 3, 2022
apacheGH-1557: Added getters to DynamicDataset to make the original construction parameters accessible
Aklakan added a commit to Aklakan/jena that referenced this issue Oct 3, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Oct 3, 2022
Aklakan added a commit to Aklakan/jena that referenced this issue Oct 6, 2022
@afs afs closed this as completed in #1558 Oct 6, 2022
afs added a commit that referenced this issue Oct 6, 2022
GH-1557: Allow access to additional information from DynamicDataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Incrementally add new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants