Merge pull request #3442 from flairNLP/incorporate_hunflair2_docs_to_…

…docpage Incorporate hunflair2 docs to docpage
flairNLP · May 24, 2024 · 1037fcc · 1037fcc
2 parents 8bcc3d9 + 64afaea
commit 1037fcc
Show file tree

Hide file tree

Showing 14 changed files with 750 additions and 137 deletions.
diff --git a/docs/tutorial/index.rst b/docs/tutorial/index.rst
@@ -10,4 +10,5 @@ Tutorials
    intro
    tutorial-basics/index
    tutorial-training/index
-   tutorial-embeddings/index
+   tutorial-embeddings/index
+   tutorial-hunflair2/index
diff --git a/docs/tutorial/tutorial-basics/entity-mention-linking.md b/docs/tutorial/tutorial-basics/entity-mention-linking.md
@@ -1,6 +1,7 @@
 # Using and creating entity mention linker
 
-As of Flair 0.14 we ship the [entity mention linker](#flair.models.EntityMentionLinker) - the core framework behind the [Hunflair BioNEN aproach](https://huggingface.co/hunflair)]. 
+As of Flair 0.14 we ship the [entity mention linker](#flair.models.EntityMentionLinker) - the core framework behind the [Hunflair BioNEN approach](https://huggingface.co/hunflair)]. 
+You can read more at the [Hunflair2 tutorials](project:../tutorial-hunflair2/overview.md)
 
 ## Example 1: Printing Entity linking outputs to console
 
@@ -19,7 +20,7 @@ sentence = Sentence(
     use_tokenizer=SciSpacyTokenizer()
 )
 
-ner_tagger = Classifier.load("hunflair")
+ner_tagger = Classifier.load("hunflair2")
 ner_tagger.predict(sentence)
 
 nen_tagger = EntityMentionLinker.load("disease-linker-no-ab3p")
@@ -31,7 +32,7 @@ for tag in sentence.get_labels():
 
 ```{note}
   Here we use the `disease-linker-no-ab3p` model, as it is the simplest model to run. You might get better results by using `disease-linker` instead,
-  but under the hood ab3p uses an executeable that is only compiled for linux and therefore won't run on every system.
+  but that would require you to install `pyab3p` via `pip install pyab3p`.
   
   Analogously to `disease` there are also linker for `chemical`, `species` and `gene`
   all work with the `{entity_type}-linker` or `{entity_type}-linker-no-ab3p` naming-schema 

diff --git a/docs/tutorial/tutorial-hunflair2/customize-linking.md b/docs/tutorial/tutorial-hunflair2/customize-linking.md
@@ -0,0 +1,146 @@
+# HunFlair2 Tutorial 4: Customizing linking models
+
+In this tutorial you'll find information on how to customize the entity linking models according to your needs.
+As of now, fine-tuning the models is not supported.
+
+## Customize dictionary
+
+All linking models come with a pre-defined pairing of entity type and dictionary,
+e.g. "Disease" mentions are linked by default to the [CTD Diseases](https://ctdbase.org/help/diseaseDetailHelp.jsp).
+You can change the dictionary to which mentions are linked by following the steps below.
+We'll be using the [Human Phenotype Ontology](https://hpo.jax.org/app/) in our example
+(Download the `hp.json` file you find [here](https://hpo.jax.org/app/data/ontology) if you want to follow along).
+
+First we load from the original data a python dictionary mapping names to concept identifiers
+
+```python
+import json
+from collections import defaultdict
+with open("hp.json") as fp:
+    data = json.load(fp)
+
+nodes = [n for n in data['graphs'][0]['nodes'] if n.get('type') == 'CLASS']
+hpo = defaultdict(list)
+for node in nodes:
+    concept_id = node['id'].replace('http://purl.obolibrary.org/obo/', '')
+    names = [node['lbl']] + [s['val'] for s in node.get('synonym', [])]
+    for name in names:
+        hpo[name].append(concept_id)  
+```
+
+Then we can convert this mapping into a [`InMemoryEntityLinkingDictionary`](#flair.datasets.entity_linking.InMemoryEntityLinkingDictionary) that can be used by our linking model:
+
+```python
+from flair.datasets.entity_linking import (
+    InMemoryEntityLinkingDictionary,
+    EntityCandidate,
+)
+
+database_name="HPO"
+
+candidates = [
+    EntityCandidate(
+        concept_id=ids[0],
+        concept_name=name,
+        additional_ids=ids[1:],
+        database_name=database_name,
+    )
+    for name, ids in hpo.items()
+]
+
+dictionary =  InMemoryEntityLinkingDictionary(
+    candidates=candidates, dataset_name=database_name
+)
+```
+
+To use this dictionary you need to initialize a new linker model with it.
+See the section below for that.
+
+## Custom pre-trained model
+
+You can initialize a new [`EntityMentionLinker`](#flair.models.EntityMentionLinker) with both a custom model and custom dictionary (see section above) like this:
+
+```python
+from flair.models import EntityMentionLinker
+pretrained_model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext"
+linker = EntityMentionLinker.build(
+                pretrained_model,
+                dictionary=dictionary,
+                hybrid_search=False, 
+                entity_type="disease",
+            )
+```
+
+Omitting the `dictionary` parameter will load the default dictionary for the specified `entity_type`.
+
+## Customizing Prediction Labels
+
+In the default setup all linker models output their prediction into the same annotation category *link*.
+To record the NEN annotation in separate categories, you can use the `pred_label_type` parameter of the
+[`predict()`](#flair.models.EntityMentionLinker.predict) method:
+
+```python
+gene_linker.predict(sentence, pred_label_type="my-genes")
+disease_linker.predict(sentence, pred_label_type="my-diseases")
+
+print("Diseases:")
+for disease_tag in sentence.get_labels("my-diseases"):
+    print(disease_tag)
+
+print("\nGenes:")
+for gene_tag in sentence.get_labels("my-genes"):
+    print(gene_tag)
+```
+
+This will output:
+
+```
+Diseases:
+Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875)
+Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203)
+
+Genes:
+Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062)
+```
+
+Moreover, each linker has a pre-defined configuration specifying for which NER annotations it should compute
+entity links:
+
+```python
+print(gene_linker.entity_label_types)
+print(disease_linker.entity_label_types)
+```
+
+By default all models will use the *ner* annotation category and apply the linking algorithm for annotations
+of the respective entity type:
+
+```python
+{'ner': {'gene'}}
+{'ner': {'disease'}}
+```
+
+You can customize this by using the `entity_label_types` parameter of the [`predict()`](#flair.models.EntityMentionLinker.predict) method:
+
+```python
+sentence = Sentence(
+    "The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, "
+    "a neurodegenerative disease, which is exacerbated by exposure to high "
+    "levels of mercury in mouse populations."
+)
+
+from flair.models import SequenceTagger
+
+# Use disease ner tagger from HunFlair v1
+hunflair1_tagger = SequenceTagger.load("hunflair-disease")
+hunflair1_tagger.predict(sentence, label_name="my-diseases")
+
+# Use the entity_label_types parameter in predict() to specify the annotation category
+disease_linker.predict(sentence, entity_label_types="my-diseases")
+```
+
+If you are using annotated texts with more fine-granular NER annotations you are able to specify the
+annotation category and tag type using a dictionary. For instance:
+
+```python
+gene_linker.predict(sentence, entity_label_types={"ner": {"gene": "protein"}})
+```
diff --git a/docs/tutorial/tutorial-hunflair2/index.rst b/docs/tutorial/tutorial-hunflair2/index.rst
@@ -0,0 +1,17 @@
+Tutorial: HunFlair2
+===================
+
+*HunFlair2* is a state-of-the-art named entity tagger and linker for biomedical texts. It comes with
+models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair2*
+builds on pretrained domain-specific language models and outperforms other biomedical
+NER tools on unseen corpora.
+
+.. toctree::
+   :glob:
+   :maxdepth: 1
+
+   overview
+   tagging
+   linking
+   training-ner-models
+   customize-linking
diff --git a/docs/tutorial/tutorial-hunflair2/linking.md b/docs/tutorial/tutorial-hunflair2/linking.md
@@ -0,0 +1,90 @@
+# HunFlair2 - Tutorial 2: Entity Linking
+
+[Part 1](project:./tagging.md) of the tutorial, showed how to use our pre-trained *HunFlair2* models to
+tag biomedical entities in your text. However, documents from different biomedical (sub-) fields may use different
+terms to refer to the exact same concept, e.g., “_tumor protein p53_”, “_tumor suppressor p53_”, “_TRP53_” are all
+valid names for the gene “TP53” ([NCBI Gene:7157](https://www.ncbi.nlm.nih.gov/gene/7157)).
+For improved integration and aggregation of entity mentions from multiple different documents linking / normalizing
+the entities to standardized ontologies or knowledge bases is required.
+
+## Linking with pre-trained HunFlair2 Models
+
+After adding named entity recognition tags to your sentence, you can link the entities to standard ontologies
+using distinct, type-specific linking models:
+
+```python
+from flair.models import EntityMentionLinker
+from flair.nn import Classifier
+from flair.data import Sentence
+
+sentence = Sentence(
+    "The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, "
+    "a neurodegenerative disease, which is exacerbated by exposure to high "
+    "levels of mercury in mouse populations."
+)
+
+# Tag named entities in the text
+ner_tagger = Classifier.load("hunflair2")
+ner_tagger.predict(sentence)
+
+# Load disease linker and perform disease linking
+disease_linker = EntityMentionLinker.load("disease-linker")
+disease_linker.predict(sentence)
+
+# Load gene linker and perform gene linking
+gene_linker = EntityMentionLinker.load("gene-linker")
+gene_linker.predict(sentence)
+
+# Load chemical linker and perform chemical linking
+chemical_linker = EntityMentionLinker.load("chemical-linker")
+chemical_linker.predict(sentence)
+
+# Load species linker and perform species linking
+species_linker = EntityMentionLinker.load("species-linker")
+species_linker.predict(sentence)
+```
+
+```{note}
+the ontologies and knowledge bases used are pre-processed the first time the normalisation is executed,
+which might takes a certain amount of time. All further calls are then based on this pre-processing and run
+much faster.
+```
+
+After running the code we can inspect and output the linked entities via:
+
+```python
+for tag in sentence.get_labels("link"):
+    print(tag)
+```
+
+This should print:
+
+```
+Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062)
+Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875)
+Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203)
+Span[23:24]: "mercury" → MESH:D008628/name=Mercury (220.39199829101562)
+Span[25:26]: "mouse" → 10090/name=Mus musculus (213.6201934814453)
+```
+
+For each entity, the output contains both the NER mention annotations and their ontology identifiers to which
+the mentions were mapped. Moreover, the official name of the entity in the ontology and the similarity score
+of the entity mention and the ontology concept is given. For instance, the official name for the disease
+"_X-linked adrenoleukodystrophy_" is adrenoleukodystrophy. The similarity scores are specific to entity type,
+ontology and linking model used and can therefore only be compared and related to those using the exact same
+setup.
+
+## Overview of pre-trained Entity Linking Models
+
+HunFlair2 comes with the following pre-trained linking models:
+
+| Entity Type | Model Name        | Ontology / Dictionary                                      | Linking Algorithm / Base Model (Data Set)                                               |
+| ----------- | ----------------- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------- |
+| Chemical    | `chemical-linker` | [CTD Chemicals](https://ctdbase.org/downloads/#allchems)   | [SapBERT (BC5CDR)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-chemical)      |
+| Disease     | `disease-linker`  | [CTD Diseases](https://ctdbase.org/downloads/#alldiseases) | [SapBERT (NCBI Disease)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-disease) |
+| Gene        | `gene-linker`     | [NCBI Gene (Human)](https://www.ncbi.nlm.nih.gov/gene)     | [SapBERT (BC2GN)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn)                 |
+| Species     | `species-linker`  | [NCBI Taxonmy](https://www.ncbi.nlm.nih.gov/taxonomy)      | [SapBERT  (UMLS)](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext) |
+
+For detailed information concerning the different models and their integration please refer to [our paper](https://arxiv.org/abs/2402.12372).
+
+If you wish to customize the models and dictionaries please refer to the [dedicated tutorial](project:./customize-linking.md).