refactor semmeddb SmartAPI annotation to better represent text snippets #833

andrewsu · 2024-07-17T17:25:19Z

TMKP represents their text snippets in a way that the UI is able to display them. In contrast, for SemMedDB, the UI only displays the first sentence of the abstract. More analysis on what BTE is doing is in NCATSTranslator/Feedback#625 (comment), and the TMPK solution is described in NCATSTranslator/Feedback#625 (comment).

rjawesome · 2024-07-22T22:27:20Z

I believe this can be done with a JQ wrap template applied on SemMedDB.

andrewsu · 2024-07-22T22:34:05Z

Great idea, @rjawesome . Though hold off on working on this for a moment. @colleenXu had a chat earlier today while they get some further clarifications on that structure and how the UI consumes it. But the jq templates do seem like a good option when it comes down to implementation!

colleenXu · 2024-08-21T07:14:44Z

@rjawesome @tokebe

For BioThings SEMMEDDB, we want to post-process some of the sub-query response data into a special TRAPI format (sentence/publication info).

Example SEMMEDDB data

https://biothings.ci.transltr.io/semmeddb/association/C0043481-STIMULATES-4780 (where this comes from)

We want to post-process each element in the predication array: keeping the sentence, pmid, predication_id for each element together.

Note:

This association has 9 sentences (predication_count) from 6 publications (pmid_count)
Two of the sentences are duplicates (predication.predication_id 171149564 and 171149565 for pmid 23868099
Additionally, there are two publications with multiple sentences:
- 25994789
- 23536959

{
  "_id": "C0043481-STIMULATES-4780",
...
  "pmid_count": 6,
...
  "predication": [
    {
      "object_score": 720,
      "object_text": "Nrf2",
      "pmid": 24597671,
      "predication_id": 73680403,
      "sentence": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.",
      "sentence_id": 21797489,
      "subject_score": 1000,
      "subject_text": "Zn"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 25994789,
      "predication_id": 142544436,
      "sentence": "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation.",
      "sentence_id": 263309274,
      "subject_score": 775,
      "subject_text": "zinc"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 25994789,
      "predication_id": 142545021,
      "sentence": "The increase of intracellular free zinc may be one mechanism for Nrf2 activation.",
      "sentence_id": 263310520,
      "subject_score": 802,
      "subject_text": "zinc"
    },
    {
      "object_score": 794,
      "object_text": "Nrf2",
      "pmid": 16723490,
      "predication_id": 166335624,
      "sentence": "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD).",
      "sentence_id": 309601386,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 23536959,
      "predication_id": 168073659,
      "sentence": "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta.",
      "sentence_id": 312795976,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 861,
      "object_text": "Nrf2",
      "pmid": 23536959,
      "predication_id": 168073663,
      "sentence": "The aortic protection by zinc against diabetes-induced pathogenic changes is associated with the up-regulation of both MT and Nrf2 expression.",
      "sentence_id": 312795978,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 901,
      "object_text": "Nrf2",
      "pmid": 23868099,
      "predication_id": 171149564,
      "sentence": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.",
      "sentence_id": 318601901,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 23868099,
      "predication_id": 171149565,
      "sentence": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.",
      "sentence_id": 318601901,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 618,
      "object_text": "Nrf2",
      "pmid": 33198336,
      "predication_id": 190294446,
      "sentence": "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations.",
      "sentence_id": 359812408,
      "subject_score": 618,
      "subject_text": "Zn"
    }
  ],
  "predication_count": 9,
...

For testing, this TRAPI query should only return the example data as 1 TRAPI edge

{
    "message": {
        "query_graph": {
            "nodes": {
                "creativeQuerySubject": {
                    "ids": ["CHEBI:27363"],
                    "categories":["biolink:ChemicalEntity"],
                    "name": "zinc"
                },
                "creativeQueryObject": {
                    "ids": ["NCBIGene:4780"],
                    "categories":["biolink:Gene", "biolink:Protein"],
                    "name": "NFE2L2"
               }
            },
            "edges": {
                "eA": {
                    "subject": "creativeQuerySubject",
                    "object": "creativeQueryObject",
                    "predicates": ["biolink:affects"],
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Then we'll want modified x-bte annotation

I have modifications stored on this branch.

each operation's parameter.fields: changed to grab the whole predication contents and pmid_count. Find-replace predication.pmid,predication.sentence ➡️ predication,pmid_count
response-mapping: adjust to use predication and pmid_count. Will use special key semmeddb_publication_info for predication field to signal special post-processing.
- pmid_count can be handled by existing code. Keep value as int!

Example:

    umls-obj:
      UMLS: object.umls              ## no prefix
      semmeddb_publication_info: predication        ## no prefixes on pmids
      "biolink:evidence_count": predication.pmid_count
      input_name: subject.name
      output_name: object.name

Then we want to format the SEMMEDDB `predication` data into TRAPI edge-attributes

Requirements:

filter the predication list:
- ONLY keep 1 element/sentence per publication if there's multiple. Okay to just pick the first one right now. This should also remove duplicate sentences (ref: Guthrie comment saying UI only plans to handle 1 sentence/publication).
- limit: only provide data for max of 50 unique sentences/publications. We'll use the pmid_count to record how many unique publications there actually are. (ref: Bill comment that text-mining is increasing to 50 sentences, Guthrie comment that UI doesn't have limit on publications shown)
Make 1 TRAPI edge-attribute for each element (now unique sentences/publications). Each will have the same attribute_type_id: so we may need to modify BTE's code to allow that in this specific case. Each will have a nested structure with sub-attributes.

Example: First element in `predications` array -> 1 TRAPI edge-attribute

The SEMMEDDB data:

    {
      "object_score": 720,
      "object_text": "Nrf2",
      "pmid": 24597671,
      "predication_id": 73680403,
      "sentence": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.",
      "sentence_id": 21797489,
      "subject_score": 1000,
      "subject_text": "Zn"
    },

The TRAPI edge-attribute:

predication_id ➡️ top-level value. Turn it into a string, since it's an ID!
sentence ➡️ sub-attribute biolink:supporting_text value
pmid ➡️ sub-attribute biolink:publications value. Add prefix!

      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "73680403",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:24597671"
          }
        ]
      },

5 more unique publications in `predication` array -> 5 more TRAPI edge-attributes

Note that I picked the first element/sentence for the 3 cases where there are multiple sentences (PMIDs 23868099, 25994789, 23536959)

      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "142544436",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:25994789"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "166335624",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD)."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:16723490"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "168073659",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:23536959"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "171149564",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:23868099"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "190294446",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:33198336"
          }
        ]
      }

Notes:

The UI uses subattributes biolink:publications, biolink:supporting_text, biolink:subject_location_in_text, biolink:object_location_in_text (ref: Guthrie comment)
Andrew says it's fine to only provide publication ID + sentence/snippet (Not subject/object location, which we don't have. We only have subject/object matching text) (ref: Bill comment)

for https://github.com/biothings/biothings_explorer/issues/833\#issuecomment-2301307891

colleenXu · 2024-10-01T06:43:38Z

@rjawesome @tokebe

I notice some edges where the evidence_count (>50) doesn't match the number of text-snippet edge-attributes (29). Maybe it's worth double-checking?

I imagine it could be accurate, if diff records had overlapping sets of publications -> merge into 1 KG edge and add the evidence_counts togther. But there also might be some overwriting/loss of data?

Example 1

Send this query through your local instance (semmeddb-only): http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["CHEBI:45713"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "ids": ["NCBIGene:207"],
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}

There should be 1 edge in the response. Console logs say there's 3 records involved (merged into 1 edge?).

The evidence count is 58, but there's only 29 text-snippet edge-attributes (32 edge-attributes total). I would have expected the max of 50, if those 58 were 58 unique PMIDs...

Example 2

Send this query through your local instance (semmeddb-only): http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["CHEBI:45713"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "ids": ["NCBIGene:23411"],
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}

There should be 1 edge in the response. Console logs say there's 5 records involved (merged into 1 edge?).

The evidence count is 83, but there's only 29 text-snippet edge-attributes (32 edge-attributes total).

rjawesome · 2024-10-01T22:57:51Z

@colleenXu @tokebe
Note: evidence_count is a bit inaccurate: in the first example given the records have pmid counts 22, 22, 36, but since evidence count is stored as a set it only includes 36, getting the total of 22+36=58 instead of 22+22+36=80 (if the 50 cap is removed and overlaps are accounted for, the actual total is 55). The calculation could use an array (to allow duplicate pmid counts), or just be tallied up based on the number of publication sentences in the attributes (with or without the cap).

Changes that I pushed (biothings/bte_trapi_query_graph_handler#219):

You were correct that the last record was overriding all the other edge records, I have fixed this
I added some logic to remove duplicate PMIDs when the semmedb sentences are being merged
I added more logic to enforce the 50 sentence cap after records have been merged (this can be removed if desired)
No changes to the evidence_count logic

tokebe · 2024-10-03T18:36:29Z

@rjawesome Decision from a meeting between myself and @colleenXu: Can you ignore the existing evidence count (@colleenXu will be removing evidence count from the response mappings) and then add special behavior to generate evidence count for Semmeddb? This would just be a straight count of PMIDs after your deduplication code.

colleenXu · 2024-10-04T18:00:24Z

@rjawesome

I've just pushed updates to the override yaml to remove the biolink:evidence_count response mapping. NCATS-Tangerine/translator-api-registry@f7d558f

If you still want to see the old behavior, you can adjust the override to use the older commit's version.

rjawesome · 2024-10-04T22:27:16Z

@colleenXu @tokebe

new evidence count added based on number of unique PMIDs in the sentence attributes after merging
off by one error with 50 publication-cap has been fixed

colleenXu · 2024-10-22T18:55:39Z

Going to revert all PRs related to this due to TRAPI edge-attribute problems found (see notes starting here) . Needs a rethink

semmed sentence edge attributes api-respone-transform.js#68 ➡️ Reverted
chore: add override for semmeddb bte-server#45 ➡️ Reverted
properly merge semmedb sentences bte_trapi_query_graph_handler#219 ➡️ Reverted
do not allow edge attribtues with same type id for trapi bte_trapi_query_graph_handler#222 from Duplicated edge-attributes #891 ➡️ Reverted

(Don't revert biothings/bte_trapi_query_graph_handler#220 from #880. That's not related enough that it's fine)

colleenXu · 2024-10-22T19:01:04Z

Messy notes:

Automat returns edge-attributes with the same type_id but diff values/info sigh -> currently only taking the first and ignoring the rest. due to Duplicated edge-attributes #891
Two paths of edge-attribute handling
- Plain response-mapping for x-bte stuff (deduplicated later)
- Response-mapping edge-attributes (x-bte) + TRAPI KP edge-attributes. Currently semmeddb text-snippet stuff is here! And changes for its handling are affecting the TRAPI KP edge-attribute handling in undesired ways...
Last record with same hash was source of edge-attributes for TRAPI response (was rewritten each time). -> only with recent semmeddb did it start "combining" aka duplications. And now we are adding a deduplication step based on attribute_type_id which is causing the current situation.
- So previous stuff was also weird (how far before?)

colleenXu · 2024-10-23T18:57:18Z

Requirements note:
Guthrie says UI uses top-level biolink:publications edge attribute (not included in original requirements gathering), but it isn't required.

We may want to rewrite the requirements anyways to make clear what we want the record-merging / evidence_count behavior to be.

rjawesome self-assigned this Jul 22, 2024

colleenXu referenced this issue in NCATS-Tangerine/translator-api-registry Aug 21, 2024

biothings semmeddb: adjust fields, response-mapping for post-processing

5f7ac52

for https://github.com/biothings/biothings_explorer/issues/833\#issuecomment-2301307891

rjawesome mentioned this issue Aug 25, 2024

semmed sentence edge attributes biothings/api-respone-transform.js#68

Merged

This was referenced Sep 25, 2024

Semmeddb edge-attributes refactor NCATS-Tangerine/translator-api-registry#158

Closed

chore: add override for semmeddb biothings/bte-server#45

Merged

rjawesome mentioned this issue Oct 1, 2024

properly merge semmedb sentences biothings/bte_trapi_query_graph_handler#219

Merged

tokebe mentioned this issue Oct 3, 2024

Store attribute values as array instead of set #880

Open

colleenXu added the On CI Related changes are deployed to CI server label Oct 18, 2024

This was referenced Oct 18, 2024

store edge attributes as arrays, convert to set later if needed biothings/bte_trapi_query_graph_handler#220

Merged

Duplicated edge-attributes #891

Closed

colleenXu added bug Something isn't working and removed On CI Related changes are deployed to CI server labels Oct 23, 2024

colleenXu added next phase for future if we're funded and removed bug Something isn't working labels Oct 24, 2024

colleenXu unassigned rjawesome Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor semmeddb SmartAPI annotation to better represent text snippets #833

refactor semmeddb SmartAPI annotation to better represent text snippets #833

andrewsu commented Jul 17, 2024

rjawesome commented Jul 22, 2024

andrewsu commented Jul 22, 2024

colleenXu commented Aug 21, 2024 •

edited

Loading

colleenXu commented Oct 1, 2024

rjawesome commented Oct 1, 2024 •

edited by colleenXu

Loading

tokebe commented Oct 3, 2024

colleenXu commented Oct 4, 2024

rjawesome commented Oct 4, 2024 •

edited

Loading

colleenXu commented Oct 22, 2024 •

edited

Loading

colleenXu commented Oct 22, 2024

colleenXu commented Oct 23, 2024 •

edited

Loading

refactor semmeddb SmartAPI annotation to better represent text snippets #833

refactor semmeddb SmartAPI annotation to better represent text snippets #833

Comments

andrewsu commented Jul 17, 2024

rjawesome commented Jul 22, 2024

andrewsu commented Jul 22, 2024

colleenXu commented Aug 21, 2024 • edited Loading

Then we want to format the SEMMEDDB predication data into TRAPI edge-attributes

colleenXu commented Oct 1, 2024

rjawesome commented Oct 1, 2024 • edited by colleenXu Loading

tokebe commented Oct 3, 2024

colleenXu commented Oct 4, 2024

rjawesome commented Oct 4, 2024 • edited Loading

colleenXu commented Oct 22, 2024 • edited Loading

colleenXu commented Oct 22, 2024

colleenXu commented Oct 23, 2024 • edited Loading

colleenXu commented Aug 21, 2024 •

edited

Loading

Then we want to format the SEMMEDDB `predication` data into TRAPI edge-attributes

rjawesome commented Oct 1, 2024 •

edited by colleenXu

Loading

rjawesome commented Oct 4, 2024 •

edited

Loading

colleenXu commented Oct 22, 2024 •

edited

Loading

colleenXu commented Oct 23, 2024 •

edited

Loading