Skip to content

Commit

Permalink
Merge pull request #1813 from flairNLP/prepare-biomed-release
Browse files Browse the repository at this point in the history
Prepare biomed release
  • Loading branch information
alanakbik authored Aug 17, 2020
2 parents c37aa78 + 112a0fe commit 1a12954
Show file tree
Hide file tree
Showing 16 changed files with 480 additions and 148 deletions.
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,18 @@ Flair is:

* **A powerful NLP library.** Flair allows you to apply our state-of-the-art natural language processing (NLP)
models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS),
sense disambiguation and classification.

* **Multilingual.** Thanks to the Flair community, we support a rapidly growing number of languages. We also now include
'*one model, many languages*' taggers, i.e. single models that predict PoS or NER tags for input text in various languages.
sense disambiguation and classification, with support for a rapidly growing number of languages.
* **A biomedical NER library.** Flair has special support for [biomedical data](/resources/docs/HUNFLAIR.md) with
state-of-the-art models for biomedical NER and support for over 32 biomedical datasets.

* **A text embedding library.** Flair has simple interfaces that allow you to use and combine different word and
document embeddings, including our proposed **[Flair embeddings](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view?usp=sharing)**, BERT embeddings and ELMo embeddings.
document embeddings, including our proposed **[Flair embeddings](https://www.aclweb.org/anthology/C18-1139/)**, BERT embeddings and ELMo embeddings.

* **A PyTorch NLP framework.** Our framework builds directly on [PyTorch](https://pytorch.org/), making it easy to
train your own models and experiment with new approaches using Flair embeddings and classes.

Now at [version 0.5.1](https://github.com/flairNLP/flair/releases)!
Now at [version 0.6](https://github.com/flairNLP/flair/releases)!

## Comparison with State-of-the-Art

Expand Down Expand Up @@ -126,6 +126,9 @@ The tutorials explain how the base NLP classes work, how you can load pre-traine
text, how you can embed your text with different word or document embeddings, and how you can train your own
language models, sequence labeling models, and text classification models. Let us know if anything is unclear.

There is also a dedicated landing page for our **[biomedical NER and datasets](/resources/docs/HUNFLAIR.md)** with
installation instructions and tutorials.

There are also good third-party articles and posts that illustrate how to use Flair:
* [How to build a text classifier with Flair](https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f)
* [How to build a microservice with Flair and Flask](https://shekhargulati.com/2019/01/04/building-a-sentiment-analysis-python-microservice-with-flair-and-flask/)
Expand Down
2 changes: 1 addition & 1 deletion flair/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

import logging.config

__version__ = "0.5.1"
__version__ = "0.6"

logging.config.dictConfig(
{
Expand Down
24 changes: 21 additions & 3 deletions flair/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -614,9 +614,7 @@ def get_label_names(self):
label_names.append(label.value)
return label_names

def get_spans(self, label_type: str, min_score=-1) -> List[Span]:

spans: List[Span] = []
def _add_spans_internal(self, spans: List[Span], label_type: str, min_score):

current_span = []

Expand Down Expand Up @@ -688,6 +686,24 @@ def get_spans(self, label_type: str, min_score=-1) -> List[Span]:

return spans

def get_spans(self, label_type: Optional[str] = None, min_score=-1) -> List[Span]:

spans: List[Span] = []

# if label type is explicitly specified, get spans for this label type
if label_type:
return self._add_spans_internal(spans, label_type, min_score)

# else determine all label types in sentence and get all spans
label_types = []
for token in self:
for annotation in token.annotation_layers.keys():
if annotation not in label_types: label_types.append(annotation)

for label_type in label_types:
self._add_spans_internal(spans, label_type, min_score)
return spans

@property
def embedding(self):
return self.get_embedding()
Expand Down Expand Up @@ -755,6 +771,8 @@ def to_tagged_string(self, main_tag=None) -> str:

if token.get_labels(label_type)[0].value == "O":
continue
if token.get_labels(label_type)[0].value == "_":
continue

tags.append(token.get_labels(label_type)[0].value)
all_tags = "<" + "/".join(tags) + ">"
Expand Down
68 changes: 24 additions & 44 deletions flair/embeddings/token.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,74 +117,52 @@ def __init__(self, embeddings: str, field: str = None):
"""
self.embeddings = embeddings

old_base_path = (
"https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/"
)
base_path = (
"https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.3/"
)
embeddings_path_v4 = (
"https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4/"
)
old_base_path = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/"
base_path = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.3/"
embeddings_path_v4 = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4/"
embeddings_path_v4_1 = "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/"
hu_path: str = "https://flair.informatik.hu-berlin.de/resources/embeddings/"

cache_dir = Path("embeddings")

# GLOVE embeddings
if embeddings.lower() == "glove" or embeddings.lower() == "en-glove":
cached_path(f"{old_base_path}glove.gensim.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(
f"{old_base_path}glove.gensim", cache_dir=cache_dir
)
embeddings = cached_path(f"{old_base_path}glove.gensim", cache_dir=cache_dir)

# TURIAN embeddings
elif embeddings.lower() == "turian" or embeddings.lower() == "en-turian":
cached_path(
f"{embeddings_path_v4_1}turian.vectors.npy", cache_dir=cache_dir
)
embeddings = cached_path(
f"{embeddings_path_v4_1}turian", cache_dir=cache_dir
)
cached_path(f"{embeddings_path_v4_1}turian.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(f"{embeddings_path_v4_1}turian", cache_dir=cache_dir)

# KOMNINOS embeddings
elif embeddings.lower() == "extvec" or embeddings.lower() == "en-extvec":
cached_path(
f"{old_base_path}extvec.gensim.vectors.npy", cache_dir=cache_dir
)
embeddings = cached_path(
f"{old_base_path}extvec.gensim", cache_dir=cache_dir
)
cached_path(f"{old_base_path}extvec.gensim.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(f"{old_base_path}extvec.gensim", cache_dir=cache_dir)

# pubmed embeddings
elif embeddings.lower() == "pubmed" or embeddings.lower() == "en-pubmed":
cached_path(f"{hu_path}pubmed_pmc_wiki_sg_1M.gensim.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(f"{hu_path}pubmed_pmc_wiki_sg_1M.gensim", cache_dir=cache_dir)

# FT-CRAWL embeddings
elif embeddings.lower() == "crawl" or embeddings.lower() == "en-crawl":
cached_path(
f"{base_path}en-fasttext-crawl-300d-1M.vectors.npy", cache_dir=cache_dir
)
embeddings = cached_path(
f"{base_path}en-fasttext-crawl-300d-1M", cache_dir=cache_dir
)
cached_path(f"{base_path}en-fasttext-crawl-300d-1M.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(f"{base_path}en-fasttext-crawl-300d-1M", cache_dir=cache_dir)

# FT-CRAWL embeddings
elif (
embeddings.lower() == "news"
or embeddings.lower() == "en-news"
or embeddings.lower() == "en"
):
cached_path(
f"{base_path}en-fasttext-news-300d-1M.vectors.npy", cache_dir=cache_dir
)
embeddings = cached_path(
f"{base_path}en-fasttext-news-300d-1M", cache_dir=cache_dir
)
cached_path(f"{base_path}en-fasttext-news-300d-1M.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(f"{base_path}en-fasttext-news-300d-1M", cache_dir=cache_dir)

# twitter embeddings
elif embeddings.lower() == "twitter" or embeddings.lower() == "en-twitter":
cached_path(
f"{old_base_path}twitter.gensim.vectors.npy", cache_dir=cache_dir
)
embeddings = cached_path(
f"{old_base_path}twitter.gensim", cache_dir=cache_dir
)
cached_path(f"{old_base_path}twitter.gensim.vectors.npy", cache_dir=cache_dir)
embeddings = cached_path(f"{old_base_path}twitter.gensim", cache_dir=cache_dir)

# two-letter language code wiki embeddings
elif len(embeddings.lower()) == 2:
Expand Down Expand Up @@ -540,8 +518,10 @@ def __init__(self,
"pt-forward": f"{aws_path}/embeddings-v0.4/lm-pt-forward.pt",
"pt-backward": f"{aws_path}/embeddings-v0.4/lm-pt-backward.pt",
# Pubmed
"pubmed-forward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-fw-lm.pt",
"pubmed-backward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-bw-lm.pt",
"pubmed-forward": f"{hu_path}/embeddings/pm_pmc-forward/pubmed-forward.pt",
"pubmed-backward": f"{hu_path}/embeddings/pm_pmc-backward/pubmed-backward.pt",
"pubmed-2015-forward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-fw-lm.pt",
"pubmed-2015-backward": f"{aws_path}/embeddings-v0.4.1/pubmed-2015-bw-lm.pt",
# Slovenian
"sl-forward": f"{aws_path}/embeddings-stefan-it/lm-sl-opus-large-forward-v0.1.pt",
"sl-backward": f"{aws_path}/embeddings-stefan-it/lm-sl-opus-large-backward-v0.1.pt",
Expand Down
20 changes: 11 additions & 9 deletions flair/models/sequence_tagger_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -978,6 +978,10 @@ def _fetch_model(model_name) -> str:
[aws_resource_path_v04, "NER-conll03-english", "en-ner-conll03-v0.4.pt"]
)

model_map["ner-pooled"] = "/".join(
[hu_path, "NER-conll03-english-pooled", "en-ner-conll03-pooled-v0.5.pt"]
)

model_map["ner-fast"] = "/".join(
[
aws_resource_path_v04,
Expand Down Expand Up @@ -1321,8 +1325,13 @@ def load(cls, model_names: Union[List[str], str]):
# if the model uses StackedEmbedding, make a new stack with previous objects
if type(model.embeddings) == StackedEmbeddings:

# sort embeddings by key alphabetically
new_stack = []
for embedding in model.embeddings.embeddings:
d = model.embeddings.get_named_embeddings_dict()
import collections
od = collections.OrderedDict(sorted(d.items()))

for k, embedding in od.items():

# check previous embeddings and add if found
embedding_found = False
Expand Down Expand Up @@ -1361,11 +1370,4 @@ def load(cls, model_names: Union[List[str], str]):
taggers[model_name] = model
models.append(model)

return cls(taggers)

def get_all_spans(self, sentence: Sentence):
spans = []
for name in self.name_to_tagger:
spans.extend(sentence.get_spans(name))

return spans
return cls(taggers)
3 changes: 2 additions & 1 deletion flair/trainers/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ def train(
batch_growth_annealing: bool = False,
shuffle: bool = True,
param_selection_mode: bool = False,
write_weights: bool = False,
num_workers: int = 6,
sampler=None,
use_amp: bool = False,
Expand Down Expand Up @@ -405,7 +406,7 @@ def train(
)
batch_time = 0
iteration = self.epoch * total_number_of_batches + batch_no
if not param_selection_mode:
if not param_selection_mode and write_weights:
weight_extractor.extract_weights(
self.model.state_dict(), iteration
)
Expand Down
53 changes: 29 additions & 24 deletions resources/docs/HUNFLAIR.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# HunFlair

<i>HunFlair</i> is a state-of-the-art NER tagger for biomedical texts. It comes with
models for genes/proteins, chemicals, diseases, species and cell lines. <i>HunFlair</i>
*HunFlair* is a state-of-the-art NER tagger for biomedical texts. It comes with
models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair*
builds on pretrained domain-specific language models and outperforms other biomedical
NER tools on unseen corpora. Furthermore, it contains harmonized versions of [31 biomedical
NER data sets](HUNFLAIR_CORPORA.md).


NER data sets](HUNFLAIR_CORPORA.md) and comes with a Flair language model ("pubmed-X") and
FastText embeddings ("pubmed") that were trained on roughly 3 million full texts and about
25 million abstracts from the biomedical domain.

<b>Content:</b>
[Quick Start](#quick-start) |
Expand All @@ -17,7 +17,7 @@ NER data sets](HUNFLAIR_CORPORA.md).
## Quick Start

#### Requirements and Installation
<i>HunFlair</i> is based on Flair 0.6+ and Python 3.6+.
*HunFlair* is based on Flair 0.6+ and Python 3.6+.
If you do not have Python 3.6, install it first. [Here is how for Ubuntu 16.04](https://vsupalov.com/developing-with-python3-6-on-ubuntu-16-04/).
Then, in your favorite virtual environment, simply do:
```
Expand All @@ -34,37 +34,40 @@ pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/e
Let's run named entity recognition (NER) over an example sentence. All you need to do is
make a Sentence, load a pre-trained model and use it to predict tags for the sentence:
```python
import flair
from flair.data import Sentence
from flair.models import MultiTagger
from flair.tokenization import SciSpacyTokenizer

sentence = flair.data.Sentence(
"Behavioral Abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=SciSpacyTokenizer()
)
# make a sentence and tokenize with SciSpaCy
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=SciSpacyTokenizer())

tagger = flair.models.MultiTagger.load("hunflair")
# load biomedical tagger
tagger = MultiTagger.load("hunflair")

# tag sentence
tagger.predict(sentence)
```
Done! The Sentence now has entity annotations. Let's print the entities found by the tagger:
```python
for entity in tagger.get_all_spans(sentence):
for entity in sentence.get_spans():
print(entity)
```
This should print:
~~~
Span [5]: "Fmr1" [− Labels: Gene (0.6896)]
Span [1,2]: "Behavioral Abnormalities" [− Labels: Disease (0.706)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.9863)]
Span [7]: "Mouse" [− Labels: Species (0.9517)]
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
Span [5]: "Fmr1" [− Labels: Gene (0.838)]
Span [7]: "Mouse" [− Labels: Species (0.9979)]
~~~

## Comparison to other biomedical NER tools
Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets.
However, they are applied "in the wild", i.e., to a much larger collection of texts, often varying in
However, they are applied "in the wild" to a much larger collection of texts, often varying in
topic, entity distribution, genre (e.g. patents vs. scientific articles) and text type (e.g. abstract
vs. full text), which can lead to severe drops in performance.

<i>HunFlair</i> outperforms other biomedical NER tools on corpora not used for training of neither HunFlair
*HunFlair* outperforms other biomedical NER tools on corpora not used for training of neither *HunFlair*
or any of the competitor tools.

| Corpus | Entity Type | Misc<sup><sub>[1](#f1)</sub></sup> | SciSpaCy | HUNER | HunFlair |
Expand All @@ -81,20 +84,22 @@ or any of the competitor tools.
<sub>All results are F1 scores using partial matching of predicted text offsets with the original char offsets
of the gold standard data. We allow a shift by max one character.</sub>

<a name="f1">1</a>: Misc displays the results of multiple taggers:
<sub><a name="f1">1</a>: Misc displays the results of multiple taggers:
[tmChem](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmchem/) for Chemical,
[GNormPus](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/) for Gene and Species, and
[DNorm](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/DNorm.html) for Disease
</sub>


Here's how to [reproduce these numbers](XXX) using Flair. You can also find detailed evaluations and discussions in our paper.
Here's how to [reproduce these numbers](HUNFLAIR_EXPERIMENTS.md) using Flair.
You can find detailed evaluations and discussions in [our paper](http://arxiv.org/abs/XXX).

## Tutorials
We provide a set of quick tutorials to get you started with HunFlair:
We provide a set of quick tutorials to get you started with *HunFlair*:
* [Tutorial 1: Tagging](HUNFLAIR_TUTORIAL_1_TAGGING.md)
* [Tutorial 2: Training biomedical NER models](HUNFLAIR_TUTORIAL_2_TRAINING.md)

## Citing HunFlair
Please cite the following paper when using HunFlair:
Please cite the following paper when using *HunFlair*:
~~~
@article{weber2020hunflair,
author = {Weber, Leon and S{\"a}nger, Mario and M{\"u}nchmeyer, Jannes and Habibi, Maryam and Leser, Ulf and Akbik, Alan},
Expand Down
Loading

0 comments on commit 1a12954

Please sign in to comment.