Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adjust SmartAPI yaml, x-bte annotation for Biolink/Monarch API migration #774

Closed
colleenXu opened this issue Jan 17, 2024 · 15 comments
Closed
Labels
data source On Test Related changes are deployed to Test server x-bte

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 17, 2024

EDIT: see below for update, actually migrating to v3 https://api-v3.monarchinitiative.org/v3/docs#/

We are using Biolink/Monarch API v1, which will soon be shutdown and replaced by v2 http://api-v2.monarchinitiative.org/api.

So we'll want to adjust the SmartAPI yaml using the v2's swagger spec + adjust the x-bte annotation if needed.

What's unclear at the moment:

colleenXu added a commit to NCATS-Tangerine/translator-api-registry that referenced this issue Jan 19, 2024
@colleenXu
Copy link
Collaborator Author

colleenXu commented Jan 19, 2024

Jackson @tokebe noticed some increased request failures, so I updated the SmartAPI yaml / registration to use the v2 server url (see lab Slack convo). We'll monitor to see if there's any improvement.

  • I checked every x-bte operation and didn't notice any issues with migrating to v2 - so it seems like the endpoints / response-format were the same.
  • However, it wasn't clear to me if there was a speed boost when using v2:
    • when directly querying the APIs, v2 did appear to respond faster than v1, but I found it hard to test (v1 would be faster if v2 was run first? later queries would be similar and fast, maybe due to caching?)
    • my BTE local using v2 seemed to run similarly/slower than BTE-dev (which was using v1 at the time). And both seemed to run significantly slower than the direct queries. So I suspect that the post-processing is taking most of the time for the sub-query.

Potential queries for directly comparing v1 and v2:

@kevinschaper
Copy link

Hi @colleenXu,

We're shutting down api.monarchinitiative.org, and our new production api is served from api-v3.monarchinitiative.org. As a transition to let people know that api.monarchinitiative.org is going away, we're planning to put a message up on that host but continue to make it available on another hostname - we picked api-v2 for that, but unfortunately it does make total sense that it would appear to be the replacement.

The v3 api format is different, the good news is that we should be better able to address performance problems (within limits). The v3 api is served from the new core graph, which is built on the biolink data model with new ingests.

Side note, I'm actually not seeing any direct gene expression for spinal cord or pancreas in the new graph:

http://api-v3.monarchinitiative.org/v3/api/association?predicate=biolink:expressed_in&object=UBERON:0001264&direct=true

http://api-v3.monarchinitiative.org/v3/api/association?predicate=biolink:expressed_in&object=UBERON:0002240&direct=true

I created an issue for specifying the subject/object taxon, and a second issue to look at our gene expression ingests.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Jan 31, 2024

[EDITED w/ updated info]

Latest info on the Biolink/monarch migration to v3 https://api-v3.monarchinitiative.org/v3/docs#/:

  • begins Feb 7 with shutdown of old v1 service https://api.monarchinitiative.org/api/
  • the API version we currently use (v2) should still be available until March 20
    • it's the same as the http://api-biolink.monarchinitiative.org in that blog post
    • info from Kevin Schaper (Translator Slack link): "Either one is ok, they're just DNS entries for the same VM - I added api-biolink.mi.org because I realized that it made total sense to assume that api-v2 comes after api, and I wanted to avoid that confusion."
  • by March 20, we need to be fully using v3 for all instances

So the next steps are:

  • I get a simple SmartAPI yaml + x-bte annotation done for the new version v3, with like 1-2 operations written -> DONE Jan 31
  • hand it off to Jackson @tokebe with some example raw queries / what data we'd like to pull out of it, so they can start working on the api-response-transform changes -> DONE

@colleenXu

This comment was marked as duplicate.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 2, 2024

Notes

On writing SmartAPI yaml

  • using their OpenAPI spec (downloaded 2/20, converted json ➡️ yaml) as a starting point
  • made several changes so SmartAPI editor could validate the yaml
    • downgraded to OpenAPI 3.0.3 (SmartAPI editor doesn't support 3.1)
    • added servers section
    • commented out lines that the editor said had errors:
      • type: 'null'
      • examples
  • commented out the / endpoints (not necessary?)
  • wrote parts of the info section (contact, description, termsOfService url, title) using information in the paper and on the website

Querying the v3 API

  • (this is an old note on the download parameter, which the association endpoint doesn't have anymore) setting the download parameter as false often didn't work - I'd be prompted to download the response as a file. Instead, not specifying the download parameter at all seemed to work best
  • using the association endpoint after feedback from Kevin Schaper (Translator Slack post)
  • these are GET queries, so only 1 input ID at a time (not batch)
  • each query returns only 500 items
    • I encountered error 500 (Internal Service Error) when trying to set the limit parameter to > 500 or to -1 (worked w/ the old API to return all hits)
    • we'd need code changes to support "scrolling" GET queries to get all the items (involving the offset parameter and total field in the response)
  • useful stuff for finding examples, possible associations:
    • entity/{id}: can see what kinds of things are connected to that input ID (association_counts field)
    • AssociationCategory enum in openapi spec
    • AssociationPredicate enum in openapi spec
    • EntityCategory enum in openapi spec
  • old data/operations that are no longer available, but were in the v1/v2 API (keeping these v1 links as examples, but they're broken now that the v1 API has been shut down)

(@kevinschaper and any others working on the Monarch API may find this post interesting)

@colleenXu

This comment was marked as outdated.

@colleenXu

This comment was marked as outdated.

@colleenXu

This comment was marked as duplicate.

colleenXu referenced this issue in NCATS-Tangerine/translator-api-registry Feb 6, 2024
see https://github.com/biothings/biothings_explorer/issues/774\#issuecomment-1923328949 for reasons to use association endpoint
now covering all the operations written for old api (that are still available)
@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 6, 2024

Jackson @tokebe:

I changed the x-bte annotation to use the associations endpoint:

So now the post-processing is different, but hopefully simpler...

STILL NEED:

  • publications: same post-processing as before.
Publication info from old comment

B. Publications

For now: within an item/hit, only keep elements in the publications field array that have the prefix PMID. These will be in the format PMID:24468074.

I've noticed other kinds of elements like:

  • OMIM curies
  • orphanet curies

Also, there's a publications_links field but we may need special logic to decide when to use the publications_links.id (for PMID) vs publications_links.url (for other kinds of references?).

DON'T NEED:

  • Now input ID should exactly match the subject or object field, so we don't need to check/filter.
    • the input ID is explicitly set as the subject or object in the query parameters (couldn't do that with the entity endpoint)
    • the query parameters are set to direct edges only (direct: true) - so API shouldn't do any ontology-traversal/expansion to the input ID. Example: lots of hits for autosomal dominant cerebellar ataxia if direct: false, but none if direct: true
  • checking that output namespace matches subject_namespace or object_namespace field (depending on direction)
    • now using the query parameters to explicitly set the namespaces of the subject and object field's IDs.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 7, 2024

[EDITED to add info on what we learned / addressed while working on the API post-processing]

Update

The basic set of updates is done:

  • SmartAPI yaml w/ x-bte annotation covers all the association-types we covered in the old API that are still available in the new v3 API
  • tested all operations w/ Jackson's updated post-processing (PR, based on my comment above), and all are working as-expected

Working on

Jackson @tokebe discussed the following, and they're going to try it out: doing post-processing on the primary_knowledge_source and aggregator_knowledge_source response fields, creating a new, custom field formatted as a TRAPI edge sources (array of objects). BTE can then ingest it with the same response-mapping key trapi_sources as Multiomics/Text-Mining APIs.

Example

first hit in https://api-v3.monarchinitiative.org/v3/api/association?category=biolink:CausalGeneToDiseaseAssociation&subject=HGNC:11138&predicate=biolink:causes&direct=true&format=json&limit=10&offset=0

A. "primary_knowledge_source": "infores:omim" (value of this field is always a string: infores curie)
➡️ element for TRAPI sources array

{ 
    "resource_id": "infores:omim", 
    "resource_role": "primary_knowledge_source"
}

B. "aggregator_knowledge_source": ["infores:monarchinitiative", "infores:medgen"]. Value of this field is always an array of string infores-curies, in order from furthest to closest to the primary source. So medgen has omim (the primary source) as its upstream.
➡️ >=1 elements for TRAPI sources array

{ 
    "resource_id": "infores:medgen", 
    "resource_role": "aggregator_knowledge_source",
    "upstream_resource_ids": ["infores:omim"]
},
{ 
    "resource_id": "infores:monarchinitiative", 
    "resource_role": "aggregator_knowledge_source",
    "upstream_resource_ids": ["infores:medgen"]
},

Putting this together: create a new, custom field with the TRAPI sources array

{
    "sources": [
        { 
            "resource_id": "infores:omim", 
            "resource_role": "primary_knowledge_source"
        },
        { 
            "resource_id": "infores:medgen", 
            "resource_role": "aggregator_knowledge_source",
            "upstream_resource_ids": ["infores:omim"]
        },
        { 
            "resource_id": "infores:monarchinitiative", 
            "resource_role": "aggregator_knowledge_source",
            "upstream_resource_ids": ["infores:medgen"]
        }
    ]
}

implementation notes

  • commented out x-bte operation source field: BTE was ignoring this info because it is using the post-processed sources info instead (from response-mapping trapi_sources)
  • the aggregator knowledge source array is in a meaningful order (ref: Kevin Schaper, Translator Slack link). We're therefore assuming that the array is in order from furthest -> closest to primary source, so we can include upstream-resource-id info in the source objects
    • ex: bte/service provider ➡️ monarchinitiative (1st aggregator entry) ➡️ medgen (2nd aggregator entry) ➡️ omim (primary).
  • found examples of same subject/predicate/object but different provenance (using biogrid vs string) -> our decision is that records/hits should only be merged if they have the exact same provenance. Handled with biothings/api-respone-transform.js@3534b23
Example showing this

Send the following TRAPI query to Monarch API only, through BTE:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["HGNC:7551"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

BTE should make the following requests:

Then bundle these into two Edges: 1 for biogrid and 1 for string

                "313161c093025842c0f60162954b3340": {
                    "predicate": "biolink:interacts_with",
                    "subject": "NCBIGene:4607",
                    "object": "NCBIGene:84676",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:19850579",
                                "PMID:18157088"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }
                    ],
                    "sources": [
                        {
                            "resource_id": "infores:biogrid",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biogrid"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:monarchinitiative"
                            ]
                        }
                    ]
                },
                "7e8fb0a590bff1f4fc71564d36bd2bc5": {
                    "predicate": "biolink:interacts_with",
                    "subject": "NCBIGene:4607",
                    "object": "NCBIGene:84676",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:string",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:string"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:monarchinitiative"
                            ]
                        }
                    ]
                },

A similar example is TTN (HGNC:12403 / NCBIGene:7273)

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 8, 2024

Knowledge source infores IDs used by this resource

From Kevin Schaper (Translator Slack link)

  • NOT all of these infores IDs actually exist in the infores registry (v3-monarch-nonexist-infores.txt), or they may exist and not have complete entries/xref wiki pages. For the infores IDs I've seen in the responses, the following have issues:
    • medgen: no xref to wiki page. Okay because it's an aggregator?
    • orphanet: no xref to wiki page. Is a primary source
    • biogrid: no xref to wiki page. Is a primary source
  • currently only have 2 aggregators at most
    • future changes?
      • monarchinitiative/medgen/omim line to monarchinitiative/hpo-annotations/medgen/omim
      • phenio/etc lines to monarchinitiative/phenio/etc (maybe a bug right now?).
current possible knowledge source combos on edges

aggregator knowledge source primary knowledge source
infores:monarchinitiative infores:agbase
infores:monarchinitiative infores:alzheimers-university-of-toronto
infores:monarchinitiative infores:aruk-ucl
infores:monarchinitiative infores:bgee
infores:monarchinitiative infores:bhf-ucl
infores:monarchinitiative infores:biogrid
infores:monarchinitiative infores:cacao
infores:monarchinitiative infores:cafa
infores:monarchinitiative infores:complexportal
infores:monarchinitiative infores:dflat
infores:monarchinitiative infores:dibu
infores:monarchinitiative infores:dictybase
infores:monarchinitiative infores:disprot
infores:monarchinitiative infores:ensembl
infores:monarchinitiative infores:flybase
infores:monarchinitiative infores:gdb
infores:monarchinitiative infores:go-central
infores:monarchinitiative infores:go-noctua
infores:monarchinitiative infores:goc
infores:monarchinitiative infores:goc-owl
infores:monarchinitiative infores:hgnc
infores:monarchinitiative infores:hgnc-ucl
infores:monarchinitiative infores:hpa
infores:monarchinitiative infores:hpo-annotations
infores:monarchinitiative infores:intact
infores:monarchinitiative infores:interpro
infores:monarchinitiative infores:lifedb
infores:monarchinitiative infores:mgi
infores:monarchinitiative infores:mtbbase
infores:monarchinitiative infores:ntnu-sb
infores:monarchinitiative infores:orphanet
infores:monarchinitiative infores:panther
infores:monarchinitiative infores:parkinsonsuk-ucl
infores:monarchinitiative infores:phi-base
infores:monarchinitiative infores:pinc
infores:monarchinitiative infores:pombase
infores:monarchinitiative infores:reactome
infores:monarchinitiative infores:rgd
infores:monarchinitiative infores:rhea
infores:monarchinitiative infores:rnacentral
infores:monarchinitiative infores:roslin-institute
infores:monarchinitiative infores:sgd
infores:monarchinitiative infores:string
infores:monarchinitiative infores:syngo
infores:monarchinitiative infores:syngo-ucl
infores:monarchinitiative infores:syscilia-ccnet
infores:monarchinitiative infores:uniprot
infores:monarchinitiative infores:wb
infores:monarchinitiative infores:xenbase
infores:monarchinitiative infores:yubiolab
infores:monarchinitiative infores:zfin
infores:monarchinitiative, infores:alliancegenome infores:flybase
infores:monarchinitiative, infores:alliancegenome infores:mgi
infores:monarchinitiative, infores:alliancegenome infores:rgd
infores:monarchinitiative, infores:alliancegenome infores:sgd
infores:monarchinitiative, infores:alliancegenome infores:wormbase
infores:monarchinitiative, infores:alliancegenome infores:zfin
infores:monarchinitiative, infores:medgen infores:omim
infores:phenio infores:HsapDv
infores:phenio infores:bfo
infores:phenio infores:chebi
infores:phenio infores:cl
infores:phenio infores:eco
infores:phenio infores:emapa
infores:phenio infores:envo
infores:phenio infores:fao
infores:phenio infores:fbbt
infores:phenio infores:fma
infores:phenio infores:fypo
infores:phenio infores:go
infores:phenio infores:hp
infores:phenio infores:iao
infores:phenio infores:ma
infores:phenio infores:mondo
infores:phenio infores:mp
infores:phenio infores:mpath
infores:phenio infores:nbo
infores:phenio infores:ncbitaxon
infores:phenio infores:obi
infores:phenio infores:ogms
infores:phenio infores:pato
infores:phenio infores:po
infores:phenio infores:pr
infores:phenio infores:ro
infores:phenio infores:so
infores:phenio infores:uberon
infores:phenio infores:upheno
infores:phenio infores:wbbt
infores:phenio infores:wbphenotype
infores:phenio infores:xpo
infores:phenio infores:zfa
infores:phenio infores:zp

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 21, 2024

@tokebe

This is now ready for deployment!

  • I've tested that our ingest/post-processing of provenance from the API is working for all operations
  • made some recent adjustments today (2/20) based on the API updates I saw (using subject/object namespace parameters, adding gene <-> anatomy operations). Retested and all is working locally.

PRs for push to Prod:

Once these are fully deployed to Prod, we can update the registered yaml (PR) and start the process of removing the override...

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 21, 2024

Notes

  • Because we are ingesting the provenance info from the external API's responses, we aren't certain of the infores values that will be in the response. This may make it tricky to ensure the infores entries/xref wiki pages are always set up. It also complicates any effort to get allowlist/denylist working
  • we don't have a list of the possible MetaEdges (combos of subject category/subject namespace/predicate/object category/object namespace)

Stuff to follow up on

Short-term

EDIT, DONE: Sierra and Kevin confirmed 2/28 that it's fine to change infores, and we could deprecate biolink-api infores...

  • double-check on the use of infores:monarchinitiative (switching to this for info.x-translator.infores) vs infores:biolink-api (what we were using before).
    • may involve adjusting the xref wiki page, to make more obvious that we are using their non-TRAPI API?
    • BTE is still using this field to set the upstream resource ID for the bte/service-provider source element...
    • We made the switch because when using biolink-api, BTE doesn't generate a source-object for it and the provenance chain would be wonky (probably because of the post-processing to instead use the API response provenance info).
Example of wonky behavior

The edge source info would look like this:

  • service-provider trapi says biolink-api is upstream of it
  • but there's no entry for biolink-api (and...then monarchinitiative should be upstream?)
  • then there's entries for monarchinitiative and its upstream sources (which include the primary). These come from post-processing the raw API response.
                    "sources": [
                        {
                            "resource_id": "infores:hpo-annotations",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:hpo-annotations"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biolink-api"
                            ]
                        }
                    ]

  • Once these are fully deployed to Prod, we can update the registered yaml (PR) and start the process of removing the override...

Longer-term?

EDIT: moving to separate issues

  • ability to do "scrolling" GET queries to get all the items (involving the offset parameter and total field in the response). Currently, we only get 500 items per input ID/query (noted in previous post "Querying the v3 API")
  • investigate the new query options? subject/object category, taxon, namespace: example
    • look at what adds coverage, is good to have as separate operations (namespace, species context?). For example, is there cell-level/organelle-level gene-expression info?
  • annotating more MetaEdges (not covered by past operations)
click to see MetaEdges

  • Chem to Pathway: unclear how helpful this is, since chemicals seem generic (water, ADP, ATP...). Example. 1 Predicate: participates_in
    • their prefix Reactome differs from what we use (REACT)...so this may require extra post-processing support
      (depends on how helpful setting the subject/object namespace is)
    • unclear if other Pathway namespaces exist
  • Gene to Pathway: previously chose not to annotate because MyGene also covers this info. Also has prefix issue (see Chem to Pathway above). 1 Predicate: participates_in
  • Gene to GO BiologicalProcess (989349 items): previously chose not to annotate because MyGene also covers this info. Each kind has multiple possible predicates, lots of diff primary knowledge sources
    • actively_involved_in (797927)
    • acts_upstream_of_or_within (180729)
    • acts_upstream_of (9327)
    • acts_upstream_of_or_within_positive_effect (507)
    • acts_upstream_of_positive_effect (506)
    • acts_upstream_of_or_within_negative_effect (178)
    • acts_upstream_of_negative_effect (175)
  • Gene to GO MolecularActivity (848151 items): see notes for BiologicalProcess above
    • enables (841330)
    • contributes_to (6821)
  • Gene to GO CellularComponent (745837 items): see notes for BiologicalProcess above
    • located_in (502225)
    • active_in (145515)
    • part_of (94049)
    • colocalizes_with (4048)
  • Gene to Gene ortholog: previously chose not to annotate because MyGene also covers this info. 1 predicate (orthologous_to, 551383 hits). Seems to be 1 primary knowledge source (panther)

@colleenXu colleenXu added On CI Related changes are deployed to CI server On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Feb 21, 2024
@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 28, 2024

I've confirmed that the changes have been deployed to BTE Prod. So I've:

How I tested

We can tell that BTE is using the new v3 Monarch API by doing a test query for the gene-disease-contributesTo operation - which didn't exist in the old API. If we have edges with the contributes_to predicate and the enhanced sources info (omim <- medgen <- monarchinitiative <- service provider), then we know that BTE is using the new SmartAPI yaml and api-response-transform code.

POST to Monarch-API-only, thru BTE: https://bte.transltr.io/v1/smartapi/d22b657426375a5295e7da8a303b9893/query


{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["HGNC:6294", "HGNC:9652"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:contributes_to"]
                }
            }
        }
    }
}

Should get this edge in the response, showing the contributes_to predicate and the enhanced sources info (omim <- medgen <- monarchinitiative <- service provider)

                "1ff8a4f5ade3639ebd6b951ac8984627": {
                    "predicate": "biolink:contributes_to",
                    "subject": "NCBIGene:3784",
                    "object": "MONDO:0100316",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:omim",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:medgen",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:omim"
                            ]
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:medgen"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:monarchinitiative"
                            ]
                        }
                    ]
                }


BUT before closing this, I'd like to discuss "stuff to follow up on" with Jackson @tokebe first...(open new issues?)

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 29, 2024

Discussed the "stuff to follow up on" with Jackson and Sierra/Kevin (see edited post). I'll open new issues, but we're ready to close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source On Test Related changes are deployed to Test server x-bte
Projects
None yet
Development

No branches or pull requests

2 participants