Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor semmeddb SmartAPI annotation to better represent text snippets #833

Open
andrewsu opened this issue Jul 17, 2024 · 11 comments
Open
Labels
next phase for future if we're funded

Comments

@andrewsu
Copy link
Member

TMKP represents their text snippets in a way that the UI is able to display them. In contrast, for SemMedDB, the UI only displays the first sentence of the abstract. More analysis on what BTE is doing is in NCATSTranslator/Feedback#625 (comment), and the TMPK solution is described in NCATSTranslator/Feedback#625 (comment).

@rjawesome rjawesome self-assigned this Jul 22, 2024
@rjawesome
Copy link
Contributor

I believe this can be done with a JQ wrap template applied on SemMedDB.

@andrewsu
Copy link
Member Author

Great idea, @rjawesome . Though hold off on working on this for a moment. @colleenXu had a chat earlier today while they get some further clarifications on that structure and how the UI consumes it. But the jq templates do seem like a good option when it comes down to implementation!

@colleenXu
Copy link
Collaborator

colleenXu commented Aug 21, 2024

@rjawesome @tokebe

For BioThings SEMMEDDB, we want to post-process some of the sub-query response data into a special TRAPI format (sentence/publication info).

Example SEMMEDDB data

https://biothings.ci.transltr.io/semmeddb/association/C0043481-STIMULATES-4780 (where this comes from)

We want to post-process each element in the predication array: keeping the sentence, pmid, predication_id for each element together.

Note:

  • This association has 9 sentences (predication_count) from 6 publications (pmid_count)
  • Two of the sentences are duplicates (predication.predication_id 171149564 and 171149565 for pmid 23868099
  • Additionally, there are two publications with multiple sentences:
    • 25994789
    • 23536959
{
  "_id": "C0043481-STIMULATES-4780",
...
  "pmid_count": 6,
...
  "predication": [
    {
      "object_score": 720,
      "object_text": "Nrf2",
      "pmid": 24597671,
      "predication_id": 73680403,
      "sentence": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.",
      "sentence_id": 21797489,
      "subject_score": 1000,
      "subject_text": "Zn"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 25994789,
      "predication_id": 142544436,
      "sentence": "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation.",
      "sentence_id": 263309274,
      "subject_score": 775,
      "subject_text": "zinc"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 25994789,
      "predication_id": 142545021,
      "sentence": "The increase of intracellular free zinc may be one mechanism for Nrf2 activation.",
      "sentence_id": 263310520,
      "subject_score": 802,
      "subject_text": "zinc"
    },
    {
      "object_score": 794,
      "object_text": "Nrf2",
      "pmid": 16723490,
      "predication_id": 166335624,
      "sentence": "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD).",
      "sentence_id": 309601386,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 23536959,
      "predication_id": 168073659,
      "sentence": "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta.",
      "sentence_id": 312795976,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 861,
      "object_text": "Nrf2",
      "pmid": 23536959,
      "predication_id": 168073663,
      "sentence": "The aortic protection by zinc against diabetes-induced pathogenic changes is associated with the up-regulation of both MT and Nrf2 expression.",
      "sentence_id": 312795978,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 901,
      "object_text": "Nrf2",
      "pmid": 23868099,
      "predication_id": 171149564,
      "sentence": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.",
      "sentence_id": 318601901,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 1000,
      "object_text": "Nrf2",
      "pmid": 23868099,
      "predication_id": 171149565,
      "sentence": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.",
      "sentence_id": 318601901,
      "subject_score": 1000,
      "subject_text": "zinc"
    },
    {
      "object_score": 618,
      "object_text": "Nrf2",
      "pmid": 33198336,
      "predication_id": 190294446,
      "sentence": "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations.",
      "sentence_id": 359812408,
      "subject_score": 618,
      "subject_text": "Zn"
    }
  ],
  "predication_count": 9,
...

For testing, this TRAPI query should only return the example data as 1 TRAPI edge

{
    "message": {
        "query_graph": {
            "nodes": {
                "creativeQuerySubject": {
                    "ids": ["CHEBI:27363"],
                    "categories":["biolink:ChemicalEntity"],
                    "name": "zinc"
                },
                "creativeQueryObject": {
                    "ids": ["NCBIGene:4780"],
                    "categories":["biolink:Gene", "biolink:Protein"],
                    "name": "NFE2L2"
               }
            },
            "edges": {
                "eA": {
                    "subject": "creativeQuerySubject",
                    "object": "creativeQueryObject",
                    "predicates": ["biolink:affects"],
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "increased"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Then we'll want modified x-bte annotation

I have modifications stored on this branch.

  • each operation's parameter.fields: changed to grab the whole predication contents and pmid_count. Find-replace predication.pmid,predication.sentence ➡️ predication,pmid_count
  • response-mapping: adjust to use predication and pmid_count. Will use special key semmeddb_publication_info for predication field to signal special post-processing.
    • pmid_count can be handled by existing code. Keep value as int!

Example:

    umls-obj:
      UMLS: object.umls              ## no prefix
      semmeddb_publication_info: predication        ## no prefixes on pmids
      "biolink:evidence_count": predication.pmid_count
      input_name: subject.name
      output_name: object.name

Then we want to format the SEMMEDDB predication data into TRAPI edge-attributes

Requirements:

Example: First element in `predications` array -> 1 TRAPI edge-attribute

The SEMMEDDB data:

    {
      "object_score": 720,
      "object_text": "Nrf2",
      "pmid": 24597671,
      "predication_id": 73680403,
      "sentence": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.",
      "sentence_id": 21797489,
      "subject_score": 1000,
      "subject_text": "Zn"
    },

The TRAPI edge-attribute:

  • predication_id ➡️ top-level value. Turn it into a string, since it's an ID!
  • sentence ➡️ sub-attribute biolink:supporting_text value
  • pmid ➡️ sub-attribute biolink:publications value. Add prefix!
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "73680403",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:24597671"
          }
        ]
      },

5 more unique publications in `predication` array -> 5 more TRAPI edge-attributes

Note that I picked the first element/sentence for the 3 cases where there are multiple sentences (PMIDs 23868099, 25994789, 23536959)

      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "142544436",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:25994789"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "166335624",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD)."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:16723490"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "168073659",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:23536959"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "171149564",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:23868099"
          }
        ]
      },
      {
        "attribute_type_id": "biolink:has_supporting_study_result",
        "value": "190294446",
        "attributes": [
          {
            "attribute_type_id": "biolink:supporting_text",
            "value": "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations."
          },
          {
            "attribute_type_id": "biolink:publications",
            "value": "PMID:33198336"
          }
        ]
      }


Notes:

  • The UI uses subattributes biolink:publications, biolink:supporting_text, biolink:subject_location_in_text, biolink:object_location_in_text (ref: Guthrie comment)
  • Andrew says it's fine to only provide publication ID + sentence/snippet (Not subject/object location, which we don't have. We only have subject/object matching text) (ref: Bill comment)

@colleenXu
Copy link
Collaborator

@rjawesome @tokebe

I notice some edges where the evidence_count (>50) doesn't match the number of text-snippet edge-attributes (29). Maybe it's worth double-checking?

I imagine it could be accurate, if diff records had overlapping sets of publications -> merge into 1 KG edge and add the evidence_counts togther. But there also might be some overwriting/loss of data?

Example 1

Send this query through your local instance (semmeddb-only): http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["CHEBI:45713"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "ids": ["NCBIGene:207"],
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}

There should be 1 edge in the response. Console logs say there's 3 records involved (merged into 1 edge?).

The evidence count is 58, but there's only 29 text-snippet edge-attributes (32 edge-attributes total). I would have expected the max of 50, if those 58 were 58 unique PMIDs...

Example 2

Send this query through your local instance (semmeddb-only): http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["CHEBI:45713"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "ids": ["NCBIGene:23411"],
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}

There should be 1 edge in the response. Console logs say there's 5 records involved (merged into 1 edge?).

The evidence count is 83, but there's only 29 text-snippet edge-attributes (32 edge-attributes total).

@rjawesome
Copy link
Contributor

rjawesome commented Oct 1, 2024

@colleenXu @tokebe
Note: evidence_count is a bit inaccurate: in the first example given the records have pmid counts 22, 22, 36, but since evidence count is stored as a set it only includes 36, getting the total of 22+36=58 instead of 22+22+36=80 (if the 50 cap is removed and overlaps are accounted for, the actual total is 55). The calculation could use an array (to allow duplicate pmid counts), or just be tallied up based on the number of publication sentences in the attributes (with or without the cap).

Changes that I pushed (biothings/bte_trapi_query_graph_handler#219):

  • You were correct that the last record was overriding all the other edge records, I have fixed this
  • I added some logic to remove duplicate PMIDs when the semmedb sentences are being merged
  • I added more logic to enforce the 50 sentence cap after records have been merged (this can be removed if desired)
  • No changes to the evidence_count logic

@tokebe
Copy link
Member

tokebe commented Oct 3, 2024

@rjawesome Decision from a meeting between myself and @colleenXu: Can you ignore the existing evidence count (@colleenXu will be removing evidence count from the response mappings) and then add special behavior to generate evidence count for Semmeddb? This would just be a straight count of PMIDs after your deduplication code.

@colleenXu
Copy link
Collaborator

@rjawesome

I've just pushed updates to the override yaml to remove the biolink:evidence_count response mapping. NCATS-Tangerine/translator-api-registry@f7d558f

If you still want to see the old behavior, you can adjust the override to use the older commit's version.

@rjawesome
Copy link
Contributor

rjawesome commented Oct 4, 2024

@colleenXu @tokebe

  • new evidence count added based on number of unique PMIDs in the sentence attributes after merging
  • off by one error with 50 publication-cap has been fixed

@colleenXu
Copy link
Collaborator

colleenXu commented Oct 22, 2024

@colleenXu
Copy link
Collaborator

Messy notes:

  • Automat returns edge-attributes with the same type_id but diff values/info sigh -> currently only taking the first and ignoring the rest. due to Duplicated edge-attributes  #891
  • Two paths of edge-attribute handling
    • Plain response-mapping for x-bte stuff (deduplicated later)
    • Response-mapping edge-attributes (x-bte) + TRAPI KP edge-attributes. Currently semmeddb text-snippet stuff is here! And changes for its handling are affecting the TRAPI KP edge-attribute handling in undesired ways...
  • Last record with same hash was source of edge-attributes for TRAPI response (was rewritten each time). -> only with recent semmeddb did it start "combining" aka duplications. And now we are adding a deduplication step based on attribute_type_id which is causing the current situation.
    • So previous stuff was also weird (how far before?)

@colleenXu colleenXu added bug Something isn't working and removed On CI Related changes are deployed to CI server labels Oct 23, 2024
@colleenXu
Copy link
Collaborator

colleenXu commented Oct 23, 2024

Requirements note:
Guthrie says UI uses top-level biolink:publications edge attribute (not included in original requirements gathering), but it isn't required.


We may want to rewrite the requirements anyways to make clear what we want the record-merging / evidence_count behavior to be.

@colleenXu colleenXu added next phase for future if we're funded and removed bug Something isn't working labels Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next phase for future if we're funded
Projects
None yet
Development

No branches or pull requests

4 participants