You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Playing with your ROBOKOP instance, I noticed two things and would like to suggest some improvements for increased provenance transparency.
Use `primary_knowledge_source` only ever for the very original source
Lets look at a query like this:
MATCH (n:`biolink:Gene` {id: 'NCBIGene:5979'})-[r]-(d:`biolink:Disease` {id: 'DOID:0050430'})
RETURN n,r,d LIMIT 25
For the exact same association, you get a wild mix of primary_knowledge_source and aggregator_knowledge_source. I believe, while it is hard, that for aggregation of scientific evidence it is critical that the very original source making that statement is primary_knowledge_source, and all other downstream sources should only ever be mentioned as aggregator_knowledge_source. In particular, "infores:monarchinitiative" should (at least if I am not mistaken, given infores:hpo-annotations", cc @kevinschaper?) never appear as a primary_knowledge_source. Same as pharos:
Neither should, and this is IMO very important, infores:ubergraph. Here it is critical that every single integrated edge gets infores:sourceontology (e.g. infores:uberon) for maximum transparency. Again, it would be great if the edge could point somehow to a version of the knowledge source (asserted_id: [ubergraph2023-01-01, pharos2024-03-4, monarch2024-03-04], etc), but I don't know how that's done in Biolink. This is not just about provenance. This is also about attribution: We want to make sure that when we deliver high Impact KGs like ROBOKOP to the science world, everyone know that "wow, uberon really made a difference to beef up the context for our node embeddings". This is only possible if we add that info on every single edge.
Either never or always aggregate "aggregator_knowledge_source"
Right now we have a mix of cases, like in the query above. The advantage of "always aggregate" is you can see immediately how well an edge is supported in the graph (how many aggregators have deemed it trustworthy). On the other hand, there is a risk of not being able to adequately integrate association metadata if it diverges across resources. I don't know the right answer to this, but in order to recommend preprocessing for ML tools (should the number of edges between two nodes matter?) I believe this has to be done consistently.
Otherwise looks great! You have 42 knowledge sources, and all of them appear in the infores registry, which is awesome!
Thanks @matentzn, it's great to have more eyes on this.
For point one, that is certainly the intention. For the query you shared, I'm not sure those are all the exact same association. In fact, the reason there are so many edges that look similar is because they came from different primary knowledge sources, even when they came through the same aggregator. The edge merging algorithm in ORION uses the primary knowledge source as part of the criteria for determining whether two edges are the same, so edges with different primary knowledge sources are always kept separate.
uniprot -> pharos -> robokop
monarchinitiative -> pharos -> robokop
eram -> pharos -> robokop
ctd -> pharos -> robokop
This means that pharos has something about this kind of relationship from all of these underlying sources, as separate database entries. However, it appears we may have an issue with pharos because it is an aggregator of aggregators. Monarchinitiative is used because pharos has "Monarch" as the source database, and unfortunately, I don't think pharos provides the true primary source for those. Same for edges from CTD -> pharos.
We do have pharos as the primary knowledge source for some edges, but only when the real primary source could not be determined or for edge cases we never handled, but they should be considered mistakes or TBD (of course it is still helpful to identify and correct these cases).
For ubergraph, let's loop in @balhoff, but I'm under the impression that many (most?) of the edges from Ubergraph actually are generated by Ubergraph in a way where it should be considered the primary source. It is not simply aggregating knowledge from other sources, but generating edges using techniques like logical entailment.
Re: including versions of sources, that's definitely a good idea. ORION tracks source versions in graph metadata as best as it can (many sources do not provide real version identifiers), but does not include any of that inside the graph.
For your second point, I'm not sure I understand. We ingest some sources that are the primary knowledge source for their data, and some sources that are aggregators already. So to properly track that we necessarily have some edges with an aggregator knowledge source and some without. Maybe I'm misunderstanding what you mean though.
Playing with your ROBOKOP instance, I noticed two things and would like to suggest some improvements for increased provenance transparency.
Use `primary_knowledge_source` only ever for the very original source
Lets look at a query like this:For the exact same association, you get a wild mix of
primary_knowledge_source
andaggregator_knowledge_source
. I believe, while it is hard, that for aggregation of scientific evidence it is critical that the very original source making that statement isprimary_knowledge_source
, and all other downstream sources should only ever be mentioned asaggregator_knowledge_source
. In particular, "infores:monarchinitiative" should (at least if I am not mistaken, given infores:hpo-annotations", cc @kevinschaper?) never appear as aprimary_knowledge_source
. Same as pharos:Neither should, and this is IMO very important,
infores:ubergraph
. Here it is critical that every single integrated edge gets infores:sourceontology (e.g. infores:uberon) for maximum transparency. Again, it would be great if the edge could point somehow to a version of the knowledge source (asserted_id
: [ubergraph2023-01-01, pharos2024-03-4, monarch2024-03-04], etc), but I don't know how that's done in Biolink. This is not just about provenance. This is also about attribution: We want to make sure that when we deliver high Impact KGs like ROBOKOP to the science world, everyone know that "wow, uberon really made a difference to beef up the context for our node embeddings". This is only possible if we add that info on every single edge.Either never or always aggregate "aggregator_knowledge_source"
Right now we have a mix of cases, like in the query above. The advantage of "always aggregate" is you can see immediately how well an edge is supported in the graph (how many aggregators have deemed it trustworthy). On the other hand, there is a risk of not being able to adequately integrate association metadata if it diverges across resources. I don't know the right answer to this, but in order to recommend preprocessing for ML tools (should the number of edges between two nodes matter?) I believe this has to be done consistently.
Otherwise looks great! You have 42 knowledge sources, and all of them appear in the infores registry, which is awesome!
These were my two cents!
cc @marcello-deluca
The text was updated successfully, but these errors were encountered: