Merge pull request #1813 from flairNLP/prepare-biomed-release

Prepare biomed release
flairNLP · Aug 17, 2020 · 1a12954 · 1a12954
2 parents c37aa78 + 112a0fe
commit 1a12954
Show file tree

Hide file tree

Showing 16 changed files with 480 additions and 148 deletions.
diff --git a/README.md b/README.md
@@ -14,18 +14,18 @@ Flair is:
 
 * **A powerful NLP library.** Flair allows you to apply our state-of-the-art natural language processing (NLP)
 models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS),
- sense disambiguation and classification.
-
-* **Multilingual.** Thanks to the Flair community, we support a rapidly growing number of languages. We also now include
-'*one model, many languages*' taggers, i.e. single models that predict PoS or NER tags for input text in various languages.
+ sense disambiguation and classification, with support for a rapidly growing number of languages. 
+ 
+* **A biomedical NER library.** Flair has special support for [biomedical data](/resources/docs/HUNFLAIR.md) with 
+state-of-the-art models for biomedical NER and support for over 32 biomedical datasets.
 
 * **A text embedding library.** Flair has simple interfaces that allow you to use and combine different word and 
-document embeddings, including our proposed **[Flair embeddings](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view?usp=sharing)**, BERT embeddings and ELMo embeddings.
+document embeddings, including our proposed **[Flair embeddings](https://www.aclweb.org/anthology/C18-1139/)**, BERT embeddings and ELMo embeddings.
 
 * **A PyTorch NLP framework.** Our framework builds directly on [PyTorch](https://pytorch.org/), making it easy to 
 train your own models and experiment with new approaches using Flair embeddings and classes.
 
-Now at [version 0.5.1](https://github.com/flairNLP/flair/releases)!
+Now at [version 0.6](https://github.com/flairNLP/flair/releases)!
 
 ## Comparison with State-of-the-Art
 
@@ -126,6 +126,9 @@ The tutorials explain how the base NLP classes work, how you can load pre-traine
 text, how you can embed your text with different word or document embeddings, and how you can train your own 
 language models, sequence labeling models, and text classification models. Let us know if anything is unclear.
 
+There is also a dedicated landing page for our **[biomedical NER and datasets](/resources/docs/HUNFLAIR.md)** with 
+installation instructions and tutorials.
+
 There are also good third-party articles and posts that illustrate how to use Flair: 
 * [How to build a text classifier with Flair](https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f)
 * [How to build a microservice with Flair and Flask](https://shekhargulati.com/2019/01/04/building-a-sentiment-analysis-python-microservice-with-flair-and-flask/)

diff --git a/flair/__init__.py b/flair/__init__.py
@@ -24,7 +24,7 @@
 
 import logging.config
 
-__version__ = "0.5.1"
+__version__ = "0.6"
 
 logging.config.dictConfig(
     {

diff --git a/flair/data.py b/flair/data.py
@@ -614,9 +614,7 @@ def get_label_names(self):
             label_names.append(label.value)
         return label_names
 
-    def get_spans(self, label_type: str, min_score=-1) -> List[Span]:
-
-        spans: List[Span] = []
+    def _add_spans_internal(self, spans: List[Span], label_type: str, min_score):
 
         current_span = []
 
@@ -688,6 +686,24 @@ def get_spans(self, label_type: str, min_score=-1) -> List[Span]:
 
         return spans
 
+    def get_spans(self, label_type: Optional[str] = None, min_score=-1) -> List[Span]:
+
+        spans: List[Span] = []
+
+        # if label type is explicitly specified, get spans for this label type
+        if label_type:
+            return self._add_spans_internal(spans, label_type, min_score)
+
+        # else determine all label types in sentence and get all spans
+        label_types = []
+        for token in self:
+            for annotation in token.annotation_layers.keys():
+                if annotation not in label_types: label_types.append(annotation)
+
+        for label_type in label_types:
+            self._add_spans_internal(spans, label_type, min_score)
+        return spans
+
     @property
     def embedding(self):
         return self.get_embedding()
@@ -755,6 +771,8 @@ def to_tagged_string(self, main_tag=None) -> str:
 
                 if token.get_labels(label_type)[0].value == "O":
                     continue
+                if token.get_labels(label_type)[0].value == "_":
+                    continue
 
                 tags.append(token.get_labels(label_type)[0].value)
             all_tags = "<" + "/".join(tags) + ">"

diff --git a/flair/embeddings/token.py b/flair/embeddings/token.py
@@ -117,74 +117,52 @@ def __init__(self, embeddings: str, field: str = None):
         """
         self.embeddings = embeddings
 
-        old_base_path = (
-            "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/"
-        )
-        base_path = (
-            "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.3/"
-        )
-        embeddings_path_v4 = (
-            "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4/"
-        )
+        old_base_path = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/"
+        base_path = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.3/"
+        embeddings_path_v4 = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4/"
         embeddings_path_v4_1 = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/"
+        hu_path: str = "https://flair.informatik.hu-berlin.de/resources/embeddings/"
 
         cache_dir = Path("embeddings")
 
         # GLOVE embeddings
         if embeddings.lower() == "glove" or embeddings.lower() == "en-glove":
             cached_path(f"{old_base_path}glove.gensim.vectors.npy", cache_dir=cache_dir)
-            embeddings = cached_path(
-                f"{old_base_path}glove.gensim", cache_dir=cache_dir
-            )
+            embeddings = cached_path(f"{old_base_path}glove.gensim", cache_dir=cache_dir)
 
         # TURIAN embeddings
         elif embeddings.lower() == "turian" or embeddings.lower() == "en-turian":
-            cached_path(
-                f"{embeddings_path_v4_1}turian.vectors.npy", cache_dir=cache_dir
-            )
-            embeddings = cached_path(
-                f"{embeddings_path_v4_1}turian", cache_dir=cache_dir
-            )
+            cached_path(f"{embeddings_path_v4_1}turian.vectors.npy", cache_dir=cache_dir)
+            embeddings = cached_path(f"{embeddings_path_v4_1}turian", cache_dir=cache_dir)
 
         # KOMNINOS embeddings
         elif embeddings.lower() == "extvec" or embeddings.lower() == "en-extvec":
-            cached_path(
-                f"{old_base_path}extvec.gensim.vectors.npy", cache_dir=cache_dir
-            )
-            embeddings = cached_path(
-                f"{old_base_path}extvec.gensim", cache_dir=cache_dir
-            )
+            cached_path(f"{old_base_path}extvec.gensim.vectors.npy", cache_dir=cache_dir)
+            embeddings = cached_path(f"{old_base_path}extvec.gensim", cache_dir=cache_dir)
+
+        # pubmed embeddings
+        elif embeddings.lower() == "pubmed" or embeddings.lower() == "en-pubmed":
+            cached_path(f"{hu_path}pubmed_pmc_wiki_sg_1M.gensim.vectors.npy", cache_dir=cache_dir)
+            embeddings = cached_path(f"{hu_path}pubmed_pmc_wiki_sg_1M.gensim", cache_dir=cache_dir)
 
         # FT-CRAWL embeddings
         elif embeddings.lower() == "crawl" or embeddings.lower() == "en-crawl":
-            cached_path(
-                f"{base_path}en-fasttext-crawl-300d-1M.vectors.npy", cache_dir=cache_dir
-            )
-            embeddings = cached_path(
-                f"{base_path}en-fasttext-crawl-300d-1M", cache_dir=cache_dir
-            )
+            cached_path(f"{base_path}en-fasttext-crawl-300d-1M.vectors.npy", cache_dir=cache_dir)
+            embeddings = cached_path(f"{base_path}en-fasttext-crawl-300d-1M", cache_dir=cache_dir)
 
         # FT-CRAWL embeddings
         elif (
             embeddings.lower() == "news"
             or embeddings.lower() == "en-news"
             or embeddings.lower() == "en"
         ):
-            cached_path(
-                f"{base_path}en-fasttext-news-300d-1M.vectors.npy", cache_dir=cache_dir
-            )
-            embeddings = cached_path(
-                f"{base_path}en-fasttext-news-300d-1M", cache_dir=cache_dir
-            )
+            cached_path(f"{base_path}en-fasttext-news-300d-1M.vectors.npy", cache_dir=cache_dir)
+            embeddings = cached_path(f"{base_path}en-fasttext-news-300d-1M", cache_dir=cache_dir)
 
         # twitter embeddings
         elif embeddings.lower() == "twitter" or embeddings.lower() == "en-twitter":
-            cached_path(
-                f"{old_base_path}twitter.gensim.vectors.npy", cache_dir=cache_dir
-            )
-            embeddings = cached_path(
-                f"{old_base_path}twitter.gensim", cache_dir=cache_dir
-            )
+            cached_path(f"{old_base_path}twitter.gensim.vectors.npy", cache_dir=cache_dir)
+            embeddings = cached_path(f"{old_base_path}twitter.gensim", cache_dir=cache_dir)
 
         # two-letter language code wiki embeddings
         elif len(embeddings.lower()) == 2:
@@ -540,8 +518,10 @@ def __init__(self,
             "pt-forward": f"{aws_path}/embeddings-v0.4/lm-pt-forward.pt",
             "pt-backward": f"{aws_path}/embeddings-v0.4/lm-pt-backward.pt",
             # Pubmed
-            "pubmed-forward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-fw-lm.pt",
-            "pubmed-backward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-bw-lm.pt",
+            "pubmed-forward": f"{hu_path}/embeddings/pm_pmc-forward/pubmed-forward.pt",
+            "pubmed-backward": f"{hu_path}/embeddings/pm_pmc-backward/pubmed-backward.pt",
+            "pubmed-2015-forward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-fw-lm.pt",
+            "pubmed-2015-backward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-bw-lm.pt",
             # Slovenian
             "sl-forward": f"{aws_path}/embeddings-stefan-it/lm-sl-opus-large-forward-v0.1.pt",
             "sl-backward": f"{aws_path}/embeddings-stefan-it/lm-sl-opus-large-backward-v0.1.pt",

diff --git a/flair/models/sequence_tagger_model.py b/flair/models/sequence_tagger_model.py
@@ -978,6 +978,10 @@ def _fetch_model(model_name) -> str:
             [aws_resource_path_v04, "NER-conll03-english", "en-ner-conll03-v0.4.pt"]
         )
 
+        model_map["ner-pooled"] = "/".join(
+            [hu_path, "NER-conll03-english-pooled", "en-ner-conll03-pooled-v0.5.pt"]
+        )
+
         model_map["ner-fast"] = "/".join(
             [
                 aws_resource_path_v04,
@@ -1321,8 +1325,13 @@ def load(cls, model_names: Union[List[str], str]):
             # if the model uses StackedEmbedding, make a new stack with previous objects
             if type(model.embeddings) == StackedEmbeddings:
 
+                # sort embeddings by key alphabetically
                 new_stack = []
-                for embedding in model.embeddings.embeddings:
+                d = model.embeddings.get_named_embeddings_dict()
+                import collections
+                od = collections.OrderedDict(sorted(d.items()))
+
+                for k, embedding in od.items():
 
                     # check previous embeddings and add if found
                     embedding_found = False
@@ -1361,11 +1370,4 @@ def load(cls, model_names: Union[List[str], str]):
             taggers[model_name] = model
             models.append(model)
 
-        return cls(taggers)
-
-    def get_all_spans(self, sentence: Sentence):
-        spans = []
-        for name in self.name_to_tagger:
-            spans.extend(sentence.get_spans(name))
-
-        return spans
+        return cls(taggers)
diff --git a/flair/trainers/trainer.py b/flair/trainers/trainer.py
@@ -84,6 +84,7 @@ def train(
         batch_growth_annealing: bool = False,
         shuffle: bool = True,
         param_selection_mode: bool = False,
+        write_weights: bool = False,
         num_workers: int = 6,
         sampler=None,
         use_amp: bool = False,
@@ -405,7 +406,7 @@ def train(
                         )
                         batch_time = 0
                         iteration = self.epoch * total_number_of_batches + batch_no
-                        if not param_selection_mode:
+                        if not param_selection_mode and write_weights:
                             weight_extractor.extract_weights(
                                 self.model.state_dict(), iteration
                             )

diff --git a/resources/docs/HUNFLAIR.md b/resources/docs/HUNFLAIR.md
@@ -1,12 +1,12 @@
 # HunFlair
 
-<i>HunFlair</i> is a state-of-the-art NER tagger for biomedical texts. It comes with 
-models for genes/proteins, chemicals, diseases, species and cell lines. <i>HunFlair</i> 
+*HunFlair* is a state-of-the-art NER tagger for biomedical texts. It comes with 
+models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair* 
 builds on pretrained domain-specific language models and outperforms other biomedical 
 NER tools on unseen corpora. Furthermore, it contains harmonized versions of [31 biomedical 
-NER data sets](HUNFLAIR_CORPORA.md).
-
-
+NER data sets](HUNFLAIR_CORPORA.md) and comes with a Flair language model ("pubmed-X") and
+FastText embeddings ("pubmed") that were trained on roughly 3 million full texts and about
+25 million abstracts from the biomedical domain.
 
 <b>Content:</b> 
 [Quick Start](#quick-start) | 
@@ -17,7 +17,7 @@ NER data sets](HUNFLAIR_CORPORA.md).
 ## Quick Start
 
 #### Requirements and Installation
-<i>HunFlair</i> is based on Flair 0.6+ and Python 3.6+. 
+*HunFlair* is based on Flair 0.6+ and Python 3.6+. 
 If you do not have Python 3.6, install it first. [Here is how for Ubuntu 16.04](https://vsupalov.com/developing-with-python3-6-on-ubuntu-16-04/).
 Then, in your favorite virtual environment, simply do:
 ```
@@ -34,37 +34,40 @@ pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/e
 Let's run named entity recognition (NER) over an example sentence. All you need to do is 
 make a Sentence, load a pre-trained model and use it to predict tags for the sentence:
 ```python
-import flair
+from flair.data import Sentence
+from flair.models import MultiTagger
 from flair.tokenization import SciSpacyTokenizer
 
-sentence = flair.data.Sentence(
-    "Behavioral Abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
-    use_tokenizer=SciSpacyTokenizer()
-)
+# make a sentence and tokenize with SciSpaCy
+sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
+                    use_tokenizer=SciSpacyTokenizer())
 
-tagger = flair.models.MultiTagger.load("hunflair")
+# load biomedical tagger
+tagger = MultiTagger.load("hunflair")
+
+# tag sentence
 tagger.predict(sentence)
 ```
 Done! The Sentence now has entity annotations. Let's print the entities found by the tagger:
 ```python
-for entity in tagger.get_all_spans(sentence):
+for entity in sentence.get_spans():
     print(entity)
 ```
 This should print:
 ~~~
-Span [5]: "Fmr1"   [− Labels: Gene (0.6896)]
-Span [1,2]: "Behavioral Abnormalities"   [− Labels: Disease (0.706)]
-Span [10,11,12]: "Fragile X Syndrome"   [− Labels: Disease (0.9863)]
-Span [7]: "Mouse"   [− Labels: Species (0.9517)]
+Span [1,2]: "Behavioral abnormalities"   [− Labels: Disease (0.6736)]
+Span [10,11,12]: "Fragile X Syndrome"   [− Labels: Disease (0.99)]
+Span [5]: "Fmr1"   [− Labels: Gene (0.838)]
+Span [7]: "Mouse"   [− Labels: Species (0.9979)]
 ~~~
 
 ## Comparison to other biomedical NER tools
 Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets. 
-However, they are applied "in the wild", i.e., to a much larger collection of texts, often varying in 
+However, they are applied "in the wild" to a much larger collection of texts, often varying in 
 topic, entity distribution, genre (e.g. patents vs. scientific articles) and text type (e.g. abstract 
 vs. full text), which can lead to severe drops in performance.
 
-<i>HunFlair</i> outperforms other biomedical NER tools on corpora not used for training of neither HunFlair
+*HunFlair* outperforms other biomedical NER tools on corpora not used for training of neither *HunFlair*
 or any of the competitor tools.
 
 | Corpus         | Entity Type  | Misc<sup><sub>[1](#f1)</sub></sup>   | SciSpaCy | HUNER | HunFlair | 
@@ -81,20 +84,22 @@ or any of the competitor tools.
 <sub>All results are F1 scores using partial matching of predicted text offsets with the original char offsets 
 of the gold standard data. We allow a shift by max one character.</sub>
 
-<a name="f1">1</a>:  Misc displays the results of multiple taggers: 
+<sub><a name="f1">1</a>:  Misc displays the results of multiple taggers: 
 [tmChem](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmchem/) for Chemical, 
 [GNormPus](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/) for Gene and Species, and 
 [DNorm](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/DNorm.html) for Disease
+</sub>
 
-
-Here's how to [reproduce these numbers](XXX) using Flair. You can also find detailed evaluations and discussions in our paper.
+Here's how to [reproduce these numbers](HUNFLAIR_EXPERIMENTS.md) using Flair. 
+You can find detailed evaluations and discussions in [our paper](http://arxiv.org/abs/XXX).
 
 ## Tutorials
-We provide a set of quick tutorials to get you started with HunFlair:
+We provide a set of quick tutorials to get you started with *HunFlair*:
 * [Tutorial 1: Tagging](HUNFLAIR_TUTORIAL_1_TAGGING.md)
+* [Tutorial 2: Training biomedical NER models](HUNFLAIR_TUTORIAL_2_TRAINING.md)
 
 ## Citing HunFlair
-Please cite the following paper when using HunFlair:
+Please cite the following paper when using *HunFlair*:
 ~~~
 @article{weber2020hunflair,
   author    = {Weber, Leon and S{\"a}nger, Mario and M{\"u}nchmeyer, Jannes and Habibi, Maryam and Leser, Ulf and Akbik, Alan},