-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make projected dataset in DynamicDataset accessible #1557
Comments
This is not a good idea. There is no guarantee the downstream query engine does support dynamic datasets. Just because the main implementations do is not a contract. For example, union graph handling is different. Putting all transformation in one place is better. That is what clearing the query is doing - signalling that here is no further work. The correct way is to add a custom SPARQL_Processor (inherit from SPARQL_QueryDataset) if you want different behaviour. I don't see why you want to intercept the query - why send the wrong query in the first place? |
DynamicDataset work looks reversible.
But it is still the case that some day you will want to intercept Fuseki query processing - working with a collection of ARQ tweaks is restricting. Changing things one-by-one is more work overall, and isn't guaranteed to be possible. Put the extension in and have a single place to alter the system behaviour and make life easier for your group. |
PerformanceThe performance degradation introduced by using FROM clauses on our large datasets varies from "noticeably slower" via "unbearably slow" to "doesn't work anymore". For example, the aforementioned counting of triples no longer returned results. It's a certainly an ugly tradeoff - but between "strictly standard conforming result set but unbearably slow" and "usually same result set (unless there are duplicates in the union'd graphs) in a reasonable amount of time" the latter is the practically relevant one our partners want to see. Especially, we know that our graphs are disjoint. Upcoming Use Case: Graph GroupsFurthermore, a general requirement we now have in one of our projects is to allow queries need to be able to refer to It is very easy to implement these kind features on ARQ level. All that would be needed is a way to pass the original dataset graph and dataset descriptions from Fuseki down to ARQ. Of course things would have to wired up in more sophisticated ways with Fuseki if one wanted to add user-sensitive rules. E.g. when authenticated as user X then graph G maps to (A, B) and as user Y graph G maps to C. But that's not (yet) what we need.
The wrapped original dataset is not accessible (the "projected" attribute) :( (unless using brittle reflection). public static class DynamicDatasetGraph extends DatasetGraphReadOnly implements DatasetGraphWrapperView {
private final DatasetGraph projected;
public DynamicDatasetGraph(DatasetGraph viewDSG, DatasetGraph baseDSG) {
super(viewDSG, baseDSG.getContext().copy());
this.projected = baseDSG;
}
} |
That would be a small change if it does not break |
Any approach that'd allow us in the future to just drop in our plugin-jar-bundle into a vanilla Fuseki and by doing so also have these dataset-description-query-rewriting features ready for dispatch via assembler configuration is very appreciated! |
I just released that with SERVICE requests one only gets the Op representation which means the dataset description is lost. Therefore, an unwrappable DynamicDatasetGraph might be the more powerful/flexible approach. |
have you considered GRAPH - which is what other systems use? It is also in the algebra. Now saying there is a bigger agenda makes it impossible to assess incremental changes because there is now every chance they will be become irrelevant and, if across a release, tech debt. The Jena project has no idea who "your partners" are nor what their agenda is. There is a public community. That is all. Use case and requirements become out of date. Adding features in the core system that then don't get used or aren't understood is technical debt. Just keeping the codebase going takes a level of resource.
that depends on the datasets. what is the real cause? |
My intent of mentioning "graph groups" was not to wave around with "here is a bigger agenda" but rather: here is another example of a quite well known feature that could be implemented on top of it because the essence of it is also modifying a query's from/named clauses.
Maybe this wasn't clear: We want to use GRAPH internally but we don't have control over the query. It comes with FROM! If it weren't for the FROM then Fuseki wouldn't intercept it and we could just freely rewrite it to our liking in ARQ and we'd have saved all the discussion! |
Can't you rewrite the query instead of rewriting Fuseki? :) |
If there just was a SPARQL server that supported rewriting queries via custom ARQ-based query engines... Seriously, I really don't know how to get the point across that with Fuseki its already possible to rewrite ANY query UNLESS it makes use of FROM (NAMED). I am asking to lift this restriction. |
…ion parameters accessible
This reverts commit b88b006.
…ion parameters accessible
Altered DynamicDataset for From-As-Filter
For my remarks on making |
The question was why is the wrong query being asked in the first place. You say you have no control over the query. Where is it coming from? The implication I take from all this is that the motivation is "to be like Virtuoso". That implies all Virtuoso features are replicated. Not point-by-point. It'll end up going round in a loop next PR. Virtuoso is RDF 1.0. It has SQL expression evaluation semantics.
There is no exception. The operation handlers in the server for all and any functionality can be replaced. They are not special; they are the default registrations. Nothing in Fuseki is hardwired except for the dispatch process. Any query can be transformed in Fuseki. Add a custom operation implementation and handle the incoming request.
Specifically to dynamic datasets: it records the original dataset description so all query information is available. We still don't know where costs are coming from - only "our datasets" and "FROM" which is way to far away to point to the code in question. The fact you are counting (counting is optimized for TDB) matters. |
Yes, maybe you haven't seen it, but I changed the PR to only make the private "projected" (original) dataset accessible.
That's technical dept on our side:
Let's leave it out because it adds nothing to the discussion.
Apparently I am biased, but among my peers it's a common design pattern to have some form of query (template) catalogs that reference the target graph with FROM (typically a single one). And this leads to the issue of slower requests in Fuseki+TDB2 due to DynamicDataset wrapping.
If I am not mistaken, then under certain conditions FROM and GRAPH are equivalent (w.r.t. to a given context; active graph and such), such as this: SELECT (COUNT(*) AS ?c) FROM <x> { ?s ?p ?o }
SELECT (COUNT(*) AS ?c) { GRAPH <x> { ?s ?p ?o }
As it stands, it seems to me that making the projected dataset in DynamicDataset accessible would be the best solution. |
…onstruction parameters accessible
…onstruction parameters accessible
…onstruction parameters accessible
Outstanding questions:
There are two potential costs in
When all the graphs are in a single TDB (either TDB1 or TDB2):
can be addressed by pushing down the work into TDB and not using the general purpose You may wish to review the requirements that Vilnis Termanis. Not the same but related - maybe there is a single solution. |
…onstruction parameters accessible
…onstruction parameters accessible
Yes, the core of this issue and the PR are about having a general path for pushing the raw FROM clauses and the raw dataset configured in the assembler from Fuseki down to ARQ. I don't think this has to be TDB specific (e.g. a servlet specifically for TDB). With the current interfaces, the options for doing so seem to be reasonably limited to dataset, query or context. And I agree with your argument that components beneath Fuseki should receive a dataset that per-se only exposes the right amount quads w.r.t. to the protocol and security, as this makes it harder for other components/plugins to accidentally leak information. The final option I see here is having a flag in the dataset context that gets picked up by Fuseki and disables its dynamic dataset wrapping - thus passing the dataset and query on as-is. But this might again leak data when additional wrappers are involved which unintentionally incorrectly pass on that context attribute. The remaining option is to also put the projected dataset into the context just like the named graph and default graphs. But I think it makes sense to have DynamicDataset as the central point for accessing this information (in the PR I added getters for them).
Yes, but - as said - I think the first step would be having an easy path from Fuseki to ARQ.
|
apacheGH-1557: Added getters to DynamicDataset to make the original construction parameters accessible
…onstruction parameters accessible
…onstruction parameters accessible
GH-1557: Allow access to additional information from DynamicDataset
Version
4.7.0-SNAPSHOT
Feature
As a follow up to the performance issue encountered with Counting all triples performance and named graphs I created an assembler + syntax transform that pushes FROM (NAMED) into filters over graphs, such as illustrated below:
Of course this is a non-standard interpretation because it misses the RDF merge / DISTINCT operations mandated by the SPARQL spec. However, for large datasets having the possibility to explicitly enable this interpretation via a custom assembler backed by a custom
(QueryEngineFactoryFromAsFilter, DatasetGraphFromAsFilter)
-pair seemed reasonable.The issue is, that Fuseki clears the from (named) clauses in SPARQL_QueryDataset.
So my
QueryEngineFactoryFromAsFilter
never gets a chance to perform the rewriting.The relevant snippet is:
My impression is that
DynamicDatasets.dynamicDataset
is NOT needed at this place. As I expected, I see thatQueryEngineBase
andQueryEngineTDB
already callDynamicDatasets
- so is there a reason to also have this in Fuseki?I locally changed the method to only push the protocol graphs (if given) into the query (and thus let the QueryEngineFactories handle the rest) and there are no failing tests.
Are you interested in contributing a solution yourself?
Yes
The text was updated successfully, but these errors were encountered: