Releases: flairNLP/flair
Release 0.14.0
This release adds major new support for biomedical text analytics! It adds improved biomedical NER and a state-of-the-art model for biomedical entity linking. Other new features include (1) support for parameter-efficient fine-tuning and (2) various new datasets, bug fixes and enhancements! We also removed a few dependencies, so Flair should install faster and take up less space!
Biomedical NER and Entity Linking
With Flair 0.14.0, you can now detect and normalize biomedical entities in text.
For example, to analyze the sentence "We correlate genetic variants in IFNAR2 and POLG with long-COVID syndrome
", use this code snippet:
from flair.models import EntityMentionLinker
from flair.nn import Classifier
from flair.data import Sentence
# A sentence from biomedical literature
sentence = Sentence("We correlate genetic variants in IFNAR2 and POLG with long-COVID syndrome.")
# Tag named entities in the text
ner_tagger = Classifier.load("hunflair2")
ner_tagger.predict(sentence)
# Normalize disease names
disease_linker = EntityMentionLinker.load("gene-linker")
disease_linker.predict(sentence)
# Normalize gene names
gene_linker = EntityMentionLinker.load("disease-linker")
gene_linker.predict(sentence)
# Iterate over predicted entities and print
for label in sentence.get_labels():
print(label)
This should print out:
Span[5:6]: "IFNAR2" → Gene (1.0)
Span[5:6]: "IFNAR2" → 3455/name=IFNAR2
Span[7:8]: "POLG" → Gene (1.0)
Span[7:8]: "POLG" → 5428/name=POLG
Span[9:11]: "long-COVID syndrome" → Disease (1.0)
Span[9:11]: "long-COVID syndrome" → MESH:D000094024/name=Post-Acute COVID-19 Syndrome
The printout shows that:
-
"IFNAR2" is a gene. Further, it is recognized as gene 3455 ("interferon alpha and beta receptor subunit 2") in the NCBI database.
-
"POLG" is a gene. Further, it is recognized as gene 5428 ("DNA polymerase gamma, catalytic subunit") in the NCBI database.
-
"long-COVID syndrome" is a disease. Further, it is uniquely linked to "Post-Acute COVID-19 Syndrome" in the MESH database.
Big thanks to @sg-wbi @WangXII @mariosaenger @helpmefindaname for all their work:
- Entity Mention Linker by @helpmefindaname in #3388
- Support for biomedical datasets with multiple entity types by @WangXII in #3387
- Update documentation for Hunflair2 release by @mariosaenger in #3410
- Improve nel tutorial by @helpmefindaname in #3369
- Incorporate hunflair2 docs to docpage by @helpmefindaname in #3442
Parameter-Efficient Fine-Tuning
Flair 0.14.0 also adds support for PEFT.
For instance, to fine-tune a BERT model on the TREC question classification task using LoRA, use the following snippet:
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
# Note: you need to install peft to use this feature!
from peft import LoraConfig, TaskType
# Get corpus and make label dictionary
corpus: Corpus = TREC_6()
label_type = "question_class"
label_dict = corpus.make_label_dictionary(label_type=label_type)
# Define embeddings with LoRA fine-tuning
document_embeddings = TransformerDocumentEmbeddings(
"bert-base-uncased",
fine_tune=True,
# set LoRA config
peft_config=LoraConfig(
task_type=TaskType.FEATURE_EXTRACTION,
inference_mode=False,
),
)
# define model
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type=label_type)
# train model
trainer = ModelTrainer(classifier, corpus)
trainer.fine_tune(
"resources/taggers/question-classification-with-transformer",
learning_rate=5.0e-4,
mini_batch_size=4,
max_epochs=1,
)
Big thanks to @janpf for this new feature!
Smaller Library
We've removed dependencies such as gensim
from the core package, since they increased the size of the Flair library and caused some compatibility/maintenance issues. This means the core package is now smaller and fast to install.
Install as always with:
pip install flair
For certain features, you still need gensim
, such as training a model that uses classic word embeddings. For this use case, install with:
pip install flair[word-embeddings]
Or just install gensim
separately.
Big thanks to @helpmefindaname for this new feature!
- Make gensim optional by @helpmefindaname in #3493
- Update models for v0.14.0 by @alanakbik in #3505
- Relax version constraint for konoha by @himkt in #3394
- Dependencies maintainance updates by @helpmefindaname in #3402
- Make janome optional by @himkt in #3405
- Bump min. version of bpemb by @stefan-it in #3468
Other Improvements
New Features and Improvements
- Speed up euclidean distance calculation by @sheldon-roberts in #3485
- Add DataTriples which act just like DataPairs by @janpf in #3481
- Add random seed parameter to dataset splitting and downsampling for better reproducibility by @MattGPT-ai in #3475
- Allow cpu device even if gpu available by @drbh in #3417
- Add prediction label type for span classifier by @helpmefindaname in #3432
- Character embeddings store their embedding name too by @helpmefindaname in #3477
Bug Fixes
TextPairRegressor
: Fix data point iteration by @ya0guang in #3413TextPairRegressor
: Fix GPU memory leak by @MattGPT-ai in #3490TextRegressor
: Fix label_name bug by @sheldon-roberts in #3491SequenceTagger
: Fix _all_scores_for_token in ViterbiDecoder by @mauryaland in #3455SentenceSplitter
: Fix linking of sentences by @mariosaenger in #3397SentenceSplitter
: Fix case where split was performed on special characters by @helpmefindaname in #3404Classifier
: Fix loading by moving error message to main load function by @alanakbik in #3504Trainer
: Fix edge case by loading best model at end, even when there is no final evaluation by @helpmefindaname in #3470TransformerEmbeddings
: Fix special tokens by not replacing replace_additional_special_tokens by @helpmefindaname in #3451- Unit tests: Fix double
data_folder
in unit test by @ya0guang in #3412
New Datasets
- Add revision support for all Universal Dependencies datasets by @stefan-it in #3420
NER_ESTONIAN_NOISY
: Support for Estonian NER dataset with noise by @teresaloeffelhardt in #3463MASAKHA_POS
: Support for two new languages by @stefan-it in #3421UD_BAVARIAN_MAIBAAM
: Add support for new Bavarian MaiBaam UD by @stefan-it in #3426
Documentation
- Minor readme fixes by @stefan-it in #3424
- Fix typo transformer-embeddings.md by @abhisheklomsh in #3500
- Fix typo in how-model-training-works.md by @abhisheklomsh in #3499
Build Management
- Fix black and ruff by @stefan-it in #3423
- Remove zappr yaml by @helpmefindaname in #3435
- Fix
tests
package being incorrectly included in builds by @asumagic in #3440
New Contributors
- @ya0guang made their first contribution in #3413
- @drbh made their first contribution in #3417
- @asumagic made their first contribution in #3440
- @MattGPT-ai made their first contribution in #3475
- @janpf made their first contribution in #3481
- @sheldon-roberts made their first contribution in #3485
- @abhisheklomsh made their first contribution in #3500
- @teresaloeffelhardt made their first contribution in #3463
Full Changelog: v0.13.1...v0.14.0
Release 0.13.1
This releases adds some bugfixes on top of the 0.13.0 Release, and adds a new dataset.
Bug fixes
- fix doc redirect by @helpmefindaname in #3366
- fix awaiting response check by @helpmefindaname in #3371
- fix has unknown label is not always initialized by @helpmefindaname in #3372
- Fix classification report if dataset has no labels by @alanakbik in #3375
- fix flert hidden context breaks reduced vocab by @helpmefindaname in #3370
- update HF cache env variable by @helpmefindaname in #3386
Enhancements
- use batch count instead of total training samples for logging metrics by @helpmefindaname in #3374
New Datasets
New Contributors
Full Changelog: v0.13.0...v0.13.1
Release 0.13.0
This release adds several major new features such as (1) faster and more memory-efficient transformer training, (2) a new plugin system for custom logging and training, (3) new API docs for better documentation - still in beta, and (4) various new models, datasets, bug fixes and enhancements. This release also increases the minimum requirement to Python 3.8!
New Feature: Faster and more memory-efficient transformer training
This release integrates @helpmefindaname's transformer-smaller-training-vocab into the ModelTrainer. This temporarily reduces a transformer's vocabulary to only the tokens in the training dataset, and after training restores the full vocabulary. Depending on the dataset, this may effect huge savings in GPU memory and tuning speeds.
To use this feature, simply add the flag reduce_transformer_vocab=True
to the fine_tune
method. For example, to fine-tune a distilbert model on TREC_6, run this code (step 7 has the flag to reduce the vocabulary):
# 1. get the corpus
corpus: Corpus = TREC_6()
# 2. what label do we want to predict?
label_type = "question_class"
# 3. create the label dictionary
label_dict = corpus.make_label_dictionary(label_type=label_type)
# 4. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings("distilbert-base-uncased", fine_tune=True)
# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type=label_type)
# 6. initialize trainer
trainer = ModelTrainer(classifier, corpus)
# 7. fine-tune the model, but **reduce the vocabulary** for faster training
trainer.fine_tune(
"resources/taggers/question-classification-with-transformer",
reduce_transformer_vocab=True, # set this to False for slow version
)
Involved PR: add reduce transformer vocab plugin by @helpmefindaname in #3217
New Feature: Trainer Plugins
A new "Plugin" system was added to the ModelTrainer
, allowing far greater options to customize the training cycle (and slimming down the code of the ModelTrainer somewhat). For instance, it is now possible to customize logging to a far greater degree and integrate third-party logging tools.
For instance, if you want to integrate ClearML logging into the above script, simply instantiate the plugin and attach it to the trainer:
[...]
# 6. initialize trainer
trainer = ModelTrainer(classifier, corpus)
# NEW: instantiate a special logger and attach it to the trainer before the training run
ClearmlLoggerPlugin(clearml.Task.init(project_name="test", task_name="test")).attach_to(trainer)
# 7. fine-tune the model, but **reduce the vocabulary** for faster training
trainer.fine_tune(
"resources/taggers/question-classification-with-transformer",
reduce_transformer_vocab=True, # set this to False for slow version
)
Involved PRs:
- Proposal: Pluggable
ModelTrainer
train function by @plonerma in #3084 - Major refactoring of ModelTrainer by @alanakbik in #3182
- Allow users to use no scheduler and use a custom scheduling plugin by @plonerma in #3200
- Don't pickle classes & plugins in modelcard by @helpmefindaname in #3325
- Clearml logger by @helpmefindaname in #3259
- Add a convenience conversion for flair.device by @alanakbik in #3350
API Docs and other documentation
We are working towards improving our documentation. A first step was the release of our tutorial page. Now, we are adding (in beta) online API docs to make navigating the code and options offered by Flair easier. To enable it, we changed all docstrings to Google docstrings. However, this process is still ongoing, so expect the API docs to improve in coming versions of Flair.
You can find the API docs here: https://flairnlp.github.io/flair/master/api/index.html
Involved PRs:
- Creating a doc page with autodocs by @helpmefindaname in #3273
- Google doc strings by @helpmefindaname in #3164
- Add redirects to old tutorials by @alanakbik in #3211
- Add some more documentation and (rather empty) glossary page by @helpmefindaname in #3339
- Update README.md by @eltociear in #3241
- Fix embedding finetuning tutorial by @helpmefindaname in #3301
- Fix build doc page action trigger by @helpmefindaname in #3319
- Reduce gh-actions diskspace by @helpmefindaname in #3327
- Orange secondary color by @helpmefindaname in #3321
- Bump Flair and Python versions by @alanakbik in #3355
Model Refactorings
In an effort to unify class names, we now offer models that inherit from DefaultClassifier
for each label type we predict, i.e.:
TokenClassifier
for predictingToken
labelsTextPairClassifier
for predictingTextPair
labelsRelationClassifier
for predictingRelation
labelsSpanClassifier
for predictingSpan
labelsTextClassifier
for predictingSentence
labels
An advantage of such a structure is that most functionality (such as new decoders) needs to only be implemented once in DefaultClassifier
and then is immediately usable for all model classes.
To enable this, we renamed and extended WordTagger
as TokenClassifier
, and renamed Entity Linker
to SpanClassifier
. This is not a breaking change yet, as the old names are still available. But in the future, WordTagger
and Entity Linker
will be removed.
Involved PRs:
TokenClassifier
model by @alanakbik in #3203- Rename EntityLinker and remove some legacy embeddings by @alanakbik in #3295
New Models
We also add two new model classes: (1) a TextPairRegressor
for regression tasks on pairs of sentences (such as STS-B), and (2) an experimental Label Encoder method for few-shot classification.
Involved PRs:
- Add
TextPair
regression model by @plonerma in #3202 - Add dual encoder by @whoisjones in #3208
- Adapt
LabelVerbalizer
so that it also works for non-BIOES span labes by @alanakbik in #3231
New Datasets
- Integrate BigBio NER data sets into HunFlair by @mariosaenger in #3146
- Add datasets STS-B and SST-2 to flair by @plonerma in #3201
- Extend German LER Dataset by @stefan-it in #3288
- Add support for MasakhaPOS Dataset by @stefan-it in #3247
- Gh3275: sample_missing_splits in SST-2 by @plonerma in #3276
- Add German MobIE NER Dataset by @stefan-it in #3351
Build Process
- Use ruff instead of flake8 and isort by @Lingepumpe in #3213
- Update mypy by @Lingepumpe in #3210
- Use poetry instead of pipenv for developer/testing by @Lingepumpe in #3214
- Remove poetry by @helpmefindaname in #3258
Bug Fixes
- Fix seralization of config in transformers by @helpmefindaname in #3178
- Add stacklevel to log_line in order to display correct file and line number (backwards compatible) by @plonerma in #3175
- Fix tars loading by @helpmefindaname in #3212
- Fix best epoch score update by @lephong in #3220
- Fix loading of (not so) old models by @helpmefindaname in #3229
- Fix false warning for "An empty Sentence was created!" by @AbdiHaryadi in #3268
- Fix bug with sentences that do not contain a single valid transformer token by @helpmefindaname in #3230
- Fix loading of old models by @helpmefindaname in #3228
- Fix multiple arguments destination by @helpmefindaname in #3272
- Support transformers 4310 by @helpmefindaname in #3289
- Fix import error by @helpmefindaname in #3336
Enhancements
- Bump min version to 3.8 by @helpmefindaname in #3297
- Use torch native amp by @helpmefindaname in #3128
- Unpin gdown dependency by @helpmefindaname in #3176
- get_spans_from_bio: Start new span for previous S- if class also changed by @Lingepumpe in #3195
- Include
flair/py.typed
andrequirements.txt
in source distribution by @dobbersc in #3206 - Better tars inference by @helpmefindaname in #3222
- prevent fasttext embeddings to be stored separately by @helpmefindaname in #3293
- recreate
to_dict
and add relations by @helpmefindaname in https...
Release 0.12.2
Another follow-up release to 0.12 that fixes a several bugs and adds a new multilingual frame tagger. Further, our new documentation website at https://flairnlp.github.io/docs/intro is now online!
New frame tagging model #3172
Adds a new model for detecting PropBank frame. The model is trained using the "FLERT" approach, so it is much stronger than the previous 'frame' model. We also added some training data from the universal proposition bank to improve multilingual frame detection.
Use it like this:
# load the large frame model
model = Classifier.load('frame-large')
# English sentence with the verb "return" in two different senses
sentence = Sentence("Dirk returned to Berlin to return his hat.")
model.predict(sentence)
print(sentence)
# German sentence with the verb "trug" in two different senses
sentence_de = Sentence("Dirk trug einen Koffer und trug einen Hut.")
model.predict(sentence_de)
print(sentence_de)
This should print:
Sentence[9]: "Dirk returned to Berlin to return his hat." → ["returned"/return.01, "return"/return.02]
Sentence[9]: "Dirk trug einen Koffer und trug einen Hut." → ["trug"/carry.01, "trug"/wear.01]
The printout tells us that the verbs in both sentences are correctly disambiguated.
Documentation
- adds a pointer to the new Flair documentation website at https://flairnlp.github.io/docs/intro
- adds a night mode Flair logo #3145
Enhancements / New Features
- more consistent behavior of context dropout and FLERT token #3168
- settting device through environment variable #3148 (thanks @HallerPatrick)
- modify Sentence.to_original_text() to take into account Sentence.start_position for whitespace calculation #3150 (thanks @mauryaland)
- gather dev and test labels if the dataset is available #3162 (thanks @helpmefindaname)
Bug fixes
- fix bugs caused by wrong data point equality and caching #3157
- fix transformer smaller training vocab #3155 (thanks @helpmefindaname)
- update scispacy version #3144 (thanks @mariosaenger)
- unpin huggingface-hub #3149 (thanks @marctorsoc)
Release 0.12.1
This is a quick follow-up release to 0.12 that fixes a few small bugs and includes an improved version of our Zelda entity linker.
New Entity Linking model
We include a new version of our Zelda entity linker with improved predictions. Try it as follows:
from flair.nn import Classifier
from flair.data import Sentence
# load the model
tagger = Classifier.load('linker')
# make a sentence
sentence = Sentence('Kirk and Spock met on the Enterprise.')
# predict NER tags
tagger.predict(sentence)
# print predicted entities
for label in sentence.get_labels():
print(label)
This should print:
Span[0:1]: "Kirk" → James_T._Kirk (0.9969)
Span[2:3]: "Spock" → Spock (0.9971)
Span[6:7]: "Enterprise" → USS_Enterprise_(NCC-1701-D) (0.975)
Indicating correctly that the span "Kirk" points to "James_T._Kirk". As the prediction for the string "Enterprise" shows, the model is still beta and will be further improved with future releases.
Bug fixes
Release 0.12
Release 0.12 is out! This release greatly simplifies model usage for our users, includes our first entity linking model, adds support for the Ukrainian language, adds easy-to-use multitask learning, and many more features, improvements and bug fixes!
New Features
Simplify Flair model usage #3067
You can now load any Flair model through its parent class. Since most models inherit from Classifier
, you can load and run multiple different models with exactly the same code. So, to run three different taggers for sentiment, entities and frames, do:
from flair.data import Sentence
from flair.nn import Classifier
# load three taggers to tag entities, frames and sentiment
tagger_1 = Classifier.load('ner')
tagger_2 = Classifier.load('frame')
tagger_3 = Classifier.load('sentiment')
# example sentence
sentence = Sentence('Dirk celebrated in Essen')
# predict with all three models
tagger_1.predict(sentence)
tagger_2.predict(sentence)
tagger_3.predict(sentence)
# print all predictions
for label in sentence.get_labels():
print(label)
With this change, users no longer need to know which model classes implement which model. For more advanced users who do know this, the regular way for loading a model still works:
sentiment_tagger = TextClassifier.load('sentiment')
Entity Linking (BETA)
As of Flair 0.12 we ship an experimental entity linker trained on the Zelda dataset. The linker not only tags entities, but also attempts to link each entity to the corresponding Wikipedia URL if one exists.
To illustrate, let's use a short example text with two mentions of "Barcelona". The first refers to the football club "FC Barcelona", the second to the city "Barcelona".
from flair.nn import Classifier
from flair.data import Sentence
# load the model
tagger = Classifier.load('linker')
# make a sentence
sentence = Sentence('Bayern played against Barcelona. The match took place in Barcelona.')
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence)
This should print:
Sentence[12]: "Bayern played against Barcelona. The match took place in Barcelona." → ["Bayern"/FC_Bayern_Munich, "Barcelona"/FC_Barcelona, "Barcelona"/Barcelona]
As we can see, the linker can resolve what the two mentions of "Barcelona" refer to:
- the first mention "Barcelona" is linked to "FC_Barcelona"
- the second mention "Barcelona" is linked to "Barcelona"
Additionally, the mention "Bayern" is linked to "FC_Bayern_Munich", telling us that here the football club is meant.
Entity linking support includes:
- Support for the ZELDA candidate lists #3108 #3111
- Support for the ZELDA training and evaluation dataset #3088
Support for Ukrainian language #3026
This version adds support for Ukrainian taggers, embeddings and datasets. For instance, to do NER and POS tagging of a Ukrainian sentence, do:
# Load Ukrainian NER and POS taggers
from flair.models import SequenceTagger
ner_tagger = SequenceTagger.load('ner-ukrainian')
pos_tagger = SequenceTagger.load('pos-ukrainian')
# Tag a sentence
from flair.data import Sentence
sentence = Sentence("Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди.")
ner_tagger.predict(sentence)
pos_tagger.predict(sentence)
print(sentence)
# ”Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди." →
# [“Сьогодні"/ADV, "в"/ADP, "Знам’янці"/LOC, "Знам’янці"/PROPN, "проживають”/VERB, "нащадки"/NOUN, "поета"/NOUN, "—"/PUNCT, "родина"/NOUN, "Шкоди”/PERS, "Шкоди"/PROPN, "."/PUNCT]
Multitask Learning (#2910 #3085 #3101)
We add support for multitask learning in Flair (closes #2508 and closes #1260) with hopefully a simple syntax to define multiple tasks that share parts of the model.
The most common part to share is the transformer, which you might want to fine-tune across several tasks. Instantiate a transformer embedding and pass it to two separate models that you instantiate as before:
# --- Embeddings that are shared by both models --- #
shared_embedding = TransformerDocumentEmbeddings("distilbert-base-uncased", fine_tune=True)
# --- Task 1: Sentiment Analysis (5-class) --- #
corpus_1 = SENTEVAL_SST_GRANULAR()
model_1 = TextClassifier(shared_embedding,
label_dictionary=corpus_1.make_label_dictionary("class"),
label_type="class")
# -- Task 2: Binary Sentiment Analysis on Customer Reviews -- #
corpus_2 = SENTEVAL_CR()
model_2 = TextClassifier(shared_embedding,
label_dictionary=corpus_2.make_label_dictionary("sentiment"),
label_type="sentiment",
)
# -- Define mapping (which tagger should train on which model) -- #
multitask_model, multicorpus = make_multitask_model_and_corpus(
[
(model_1, corpus_1),
(model_2, corpus_2),
]
)
# -- Create model trainer and train -- #
trainer = ModelTrainer(multitask_model, multicorpus)
trainer.fine_tune(f"resources/taggers/multitask_test")
The mapping part here defines which tagger should be trained on which corpus. By calling make_multitask_model_and_corpus
with a mapping, you get a corpus and model object that you can train as before.
Explicit context boundaries in Transformer embeddings #3073 #3078
We improve our FLERT model by now explicitly marking up context boundaries using a new [FLERT]
special token in our transformer embeddings. Our experiments show that the context marker leads to improved NER results:
Transformer | Context-Marker | CoNLL-03 Test F1 |
---|---|---|
bert-base-uncased | none | 91.52 +- 0.16 |
[SEP] |
91.38 +- 0.18 | |
[FLERT] |
91.56 +- 0.17 | |
xlm-roberta-large | none | 93.73 +- 0.2 |
[SEP] |
93.76 +- 0.13 | |
[FLERT] |
93.92 +- 0.14 |
In the table, none is the approach used in previous Flair versions. [SEP]
means using the standard separator symbol as context delimiter. [FLERT]
means using a new dedicated special token.
As [FLERT]
performs best in our experiments, the [FLERT]
context marker is now activated by default.
More details: Assume the current sentence is Peter Blackburn
and the previous sentence ends with to boycott British lamb .
, while the next sentence starts with BRUSSELS 1996-08-22 The European Commission
.
In this case,
- if
use_context_separator=False
, the embedding is produced from this string:to boycott British lamb . Peter Blackburn BRUSSELS 1996-08-22 The European Commission
- if
use_context_separator=True
, the embedding is produced from this stringto boycott British lamb . [FLERT] Peter Blackburn [FLERT] BRUSSELS 1996-08-22 The European Commission
Integrate transformer-smaller-training-vocab #3066
We integrate the transformer-smaller-training-vocab
library into the ModelTrainer
. With it, you can reduce the size of transformer models when training and evaluating models on specific datasets. This leads to faster training times and a smaller memory footprint. Documentation on this new feature will be added soon!
Masked Relation Classifier #2748 #2993 with various Encoding Strategies #3023 (BETA)
We now include BETA support a new type of relation extraction model that leads to much higher accuracies than our vanilla relation extraction, but increases computational costs. Documentation for this will be added as we iterate on the model.
ONNX compatible models #2640 #2643 #3041 #3075
This release continues the journey on making our models more ONNX compatible.
Other features
- Add push to Hub functionalities #2897
- Add layoutlm layoutxlm support and the the SROIE dataset #2980
- Convenience method for learning rate factor #2888 #2893
New Datasets
- Add fewnerd corpus #3103
- Add support for NERMuD 2023 Dataset #3087
- Adds ZELDA Entity Linking dataset #3088
- Added Ukrainian NER and UD datasets #3069
- Add support MasakhaNER v2 dataset #3013
- Add support for MultiCoNerV2 #3006
- Add support for new ICDAR Europeana NER Dataset #2911
- datasets: add support for HIPE-2022 #2735 #2827 #2805
Major refactorings
- Unify loss reduction by making sure that all losses are summed over all points, instead of averaged #2933 #2910
- Python 3.7 #2769
- Flatten DefaultClassifier interface #2978
- Restructure Tokenizer and Splitter modules #3002
- Refactor Token and Sentence Positional Properties #3001
- Seralization of embeddings #3011
Various Improvements
Enhancements
- add functionality for using proxies #3082
- add option not to shuffle the first epoch #3076
- improved Tars Context #3063
- release optimizer memory and fix legacy tokenization #3043
- add time elapsed to training printout #2983
- separate between token-lengths and sub-token lengths #2990
- small speed optimizations #2975
- change output of .text to original string #2974
- remove BAD_EPOCHS printout for most schedulers #2970
- warn if resuming with too low max_epochs & ' additional_epochs' parameter #2895
- embeddings: add support for T5 encoder models #2896
- add py.typed file for PEP-561 compatibility #2858
- tars classifier always predict something on single label #2838
- make add_unk optional and don't use it for ner #2839
- add deprecation warning for SentenceDataset rename #2819
- more precise type hint for eval_on_train_fraction #2811
- better handling for consecutive whitespaces in Sentence #2721(already in flair 0.11.3)
- remove unnecessary more-itertools pin #2730 (already in flair 0.11.3)
- add
exclude_labels
parameter to trainer.train #2724 ...
Release 0.11
Release 0.11 is taking us ever closer to that 1.0 release! This release makes large internal refactorings and code quality / efficiency improvements to prepare Flair 1.0. We also add new features such as text clustering, a regular expression tagger, more dataset manipulation options, and some preview features like a prototype decoder.
New Features
Regular Expression Tagger (#2533)
You can now do sequence labeling in Flair with regular expressions! Simply define a RegexpTagger
and add some regular expressions, like in the example below:
# sentence with a number and two quotes
sentence = Sentence('Figure 11 is both "too colorful" and "not informative enough".')
# instantiate regex tagger with a quote matching pattern
tagger = RegexpTagger(mapping=(r'(["\'])(?:(?=(\\?))\2.)*?\1', 'QUOTE'))
# also add a number mapping
tagger.register_labels(mapping=(r'\b\d+\b', 'NUMBER'))
# tag sentence
tagger.predict(sentence)
# check out matches
for entity in sentence.get_labels():
print(entity)
Clustering with Flair (#2573 #2619)
Flair now supports clustering by ways of sklearn. Embed your sentences with a pre-trained embedding like below, then cluster then with any algorithm. Check the example below where we use sentence transformers and k-means clustering. A 'trained' clustering model can be saved and loaded for prediction, just like and other Flair classifier:
from sklearn.cluster import KMeans
from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel
embeddings = SentenceTransformerDocumentEmbeddings()
# store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)
clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)
# fit the model on a corpus
clustering_model.fit(corpus)
# save the model
clustering_model.save(model_file="clustering_model.pt")
# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")
# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')
# predict for sentence
model.predict(sentence)
# print sentence with prediction
print(sentence)
Dataset Manipulations
You can now change label names, ignore labels and add custom preprocessing when loading a dataset.
For instance, the standard WNUT_17 dataset comes with 7 NER labels:
corpus = WNUT_17(in_memory=False)
print(corpus.make_label_dictionary('ner'))
which prints:
Dictionary with 7 tags: <unk>, person, location, group, corporation, product, creative-work
With the following code, you rename some labels ('person' is renamed to 'PER'), merge 2 labels into 1 ('group' and 'corporation' are merged into 'LOC'), and ignore 2 other labels ('creative-work' and 'product' are ignored):
corpus = WNUT_17(in_memory=False, label_name_map={
'person': 'PER',
'location': 'LOC',
'group': 'ORG',
'corporation': 'ORG',
'product': 'O',
'creative-work': 'O', # by renaming to 'O' this tag gets ignored
})
which prints:
Dictionary with 4 tags: <unk>, PER, LOC, ORG
You can manipulate the data even more with custom preprocessing functions. See the example in #2708.
Other New Features and Data Sets
- A new
WordTagger
class for simple word-level predictions (#2607) - Classic
WordEmbeddings
can now be fine-tuned in Flair (#2491) by setting fine_tune=True. Also adds fine-tuning mode of https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens" - Add
NER_MULTI_CONER
Dataset (#2507) - Add support for HIPE 2022 (#2675)
- Allow trainer to work with mutliple learning rates (#2641)
- Update hyperparameter tuning (#2633)
Preview Features
Some preview features in beta stage, use at your own risk.
Prototypical networks in Flair (#2627)
Prototype networks learn prototypes for each target class. For each data point to be classified, the network predicts a vector in class-prototype-space, which is then compared to all class prototypes.The prediction is then the closest class prototype. See paper Prototypical Networks for Few-shot Learning for more info.
@plonerma implemented a custom decoder that can be added to any Flair model that inherits from DefaultClassifier
(i.e. early all Flair models). For instance, use this script:
from flair.data import Corpus
from flair.datasets import UP_ENGLISH
from flair.embeddings import TransformerWordEmbeddings
from flair.models import WordTagger
from flair.nn import PrototypicalDecoder
from flair.trainers import ModelTrainer
# what tag do we want to predict?
tag_type = 'frame'
# get a corpus
corpus: Corpus = UP_ENGLISH().downsample(0.1)
# make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)
# initialize simple embeddings
embeddings = TransformerWordEmbeddings(model="distilbert-base-uncased",
fine_tune=True,
layers='-1')
# initialize prototype decoder
decoder = PrototypicalDecoder(num_prototypes=len(tag_dictionary),
embeddings_size=embeddings.embedding_length,
distance_function='euclidean',
normal_distributed_initial_prototypes=True,
)
# initialize the WordTagger, but pass the prototype decoder
tagger = WordTagger(embeddings,
tag_dictionary,
tag_type,
decoder=decoder)
# initialize trainer
trainer = ModelTrainer(tagger, corpus)
# run training
trainer.fine_tune('resources/taggers/prototypical_decoder')
Other Beta features
- Dependency Parsing in Flair (#2486 #2579)
- Lemmatization in Flair (#2531)
- Initial implementation of JsonCorpora and Datasets (#2653)
Major Refactorings
With Flair expanding to many new NLP tasks (relation extraction, entity linking, etc.) and model types, we made a number of refactorings to reduce redundancy and make it easier to extend Flair.
Major refactoring of Label Logic in Flair (#2607 #2609 #2645)
The labeling logic was growing too complex to accommodate new tasks. With this release, we refactored this logic such that complex label classes like SpanLabel
, RelationLabel
etc. are removed in favor of a single Label
class for all types of label. The Sentence
object will now be automatically aware of all labels added to it.
To illustrate the difference, consider a before-and-after of how to add an entity label to a sentence.
Before:
# example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")
# create span for "Humboldt Universität zu Berlin"
span = Span(sentence[0:4])
# make a Span-label
span_label = SpanLabel(span=span, value='University')
# add Span-label to sentence
sentence.add_complex_label(typename='ner', label=span_label)
Now:
# example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")
# directly add a label to the span "Humboldt Universität zu Berlin"
sentence[0:4].add_label("ner", "Organization")
So you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.
Refactoring of printouts (#2704)
We changed and unified printouts across all Flair data points and labels, and updated the documentation to reflect this. Printouts should hopefully now be more concise. Let us know what you think.
Unified classes to reduce redundancy
Next to too many Label classes (see above), we also had too many corpora that essentially do the same thing, two partially overlapping transformer embedding classes and too much redundancy in our tokenization classes. This release makes many refactorings to make the code more maintainable:
- Unify Corpora (#2607): Unifies several corpora into a single object. Before, we had
ColumnCorpus
,UniversalDependenciesCorpus
,CoNNLuCorpus
, andEntityLinkingCorpus
, which resulted in too much redundancy. Now, there is only theColumnCorpus
for all such datasets - Unify Transformer Embeddings (#2558, #2584, #2586): There was too much redundancy and inconsistency between the two Transformer-based embeddings classes
TransformerWordEmbedding
andTransformerDocumentEmbedding
. Thanks to @helpmefindaname, they now both inherit from the same base object and now share all features. - Unify Tokenizers (#2607) : The
Tokenizer
classes no longer return lists ofToken
, rather lists of strings that theSentence
object converts to tokens, centralizing the offset and whitespace_after detection in one place.
Simplifications to DefaultClassifier
The DefaultClassifier
is the base class for nearly all models in Flair. With this release, we make a number of simplifications to reduce redundancy across classes and make it more modular.
forward_pass
simplified to return 3 instead of 4 argumentsforward_pass
returns embeddings instead of logits allowing us to easily switch out the decoder (see Beta feature on Prototype Networks below)- removed the unintuitive
spawn
logic we no longer need due to Label refactoring - unify dropouts across all classes (#2669)
Sequence tagger refactoring (#2361 #2550, #2561,#2564, #2585, #2565)
Major refactoring of SequenceTagger
for better modularity and cod...
Release 0.10
This release adds several new features such as in-built "model cards" for all Flair models, the first pre-trained models for Relation Extraction, better support for fine-tuning and a refactoring of the model training methods for more flexibility. It also fixes a number of critical bugs that were introduced by the refactorings in Flair 0.9.
Model Trainer Enhancements
Breaking change: We changed the ModelTrainer
such that you now no longer pass the optimizer during initialization. Rather, it is now passed as a parameter of the train
or fine_tune
method.
Old syntax:
# 1. initialize trainer with AdamW optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=torch.optim.AdamW)
# 2. run training with small learning rate and mini-batch size
trainer.train('resources/taggers/question-classification-with-transformer',
learning_rate=5.0e-5,
mini_batch_size=4,
)
New syntax (optimizer is parameter of train method):
# 1. initialize trainer
trainer = ModelTrainer(classifier, corpus)
# 2. run training with AdamW, small learning rate and mini-batch size
trainer.train('resources/taggers/question-classification-with-transformer',
learning_rate=5.0e-5,
mini_batch_size=4,
optimizer=torch.optim.AdamW,
)
Convenience function for fine-tuning (#2439)
Adds a fine_tune
routine that sets default parameters used for fine-tuning (AdamW optimizer, small learning rate, few epochs, cyclic learning rate scheduling, etc.). Uses the new linear scheduler with warmup (#2415).
New syntax with fine_tune
method:
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
# 1. get the corpus
corpus: Corpus = TREC_6()
# 2. what label do we want to predict?
label_type = 'question_class'
# 3. create the label dictionary
label_dict = corpus.make_label_dictionary(label_type=label_type)
# 4. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)
# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type=label_type)
# 6. initialize trainer
trainer = ModelTrainer(classifier, corpus)
# 7. run training with fine-tuning
trainer.fine_tune('resources/taggers/question-classification-with-transformer',
learning_rate=5.0e-5,
mini_batch_size=4,
)
Model Cards (#2457)
When you train any Flair model, a "model card" will now automatically be saved that stores all training parameters and versions used to train this model. Later when you load a Flair model, you can print the model card and understand how the model was trained.
The following example trains a small POS-tagger and prints the model card in the end:
# initialize corpus and make label dictionary for POS tags
corpus = UD_ENGLISH().downsample(0.01)
tag_type = "pos"
tag_dictionary = corpus.make_label_dictionary(tag_type)
# simple sequence tagger
tagger = SequenceTagger(hidden_size=256,
embeddings=WordEmbeddings("glove"),
tag_dictionary=tag_dictionary,
tag_type=tag_type)
# initialize model trainer and experiment path
trainer = ModelTrainer(tagger, corpus)
path = f'resources/taggers/model-card'
# train for a few epochs
trainer.train(path,
max_epochs=20,
)
# load best model and print "model card"
trained_model = SequenceTagger.load(path + '/best-model.pt')
trained_model.print_model_card()
This should print a model card like:
------------------------------------
--------- Flair Model Card ---------
------------------------------------
- this Flair model was trained with:
-- Flair version 0.9
-- PyTorch version 1.7.1
-- Transformers version 4.8.1
------------------------------------
------- Training Parameters: -------
------------------------------------
-- base_path = resources/taggers/model-card
-- learning_rate = 0.1
-- mini_batch_size = 32
-- mini_batch_chunk_size = None
-- max_epochs = 20
-- train_with_dev = False
-- train_with_test = False
[... shortened ...]
------------------------------------
Resume training any model (#2457)
Previously, we distinguished between checkpoints and model files. Now all models can function as checkpoints, meaning you can load them and continue training them. Say you want to load the model above (trained to epoch 20) and continue training it to epoch 25. Do it like this:
# resume training best model, but this time until epoch 25
trainer.resume(trained_model,
base_path=path + '-resume',
max_epochs=25,
)
Pass optimizer and scheduler instance
You can also now pass an initialized optimizer and scheduler to the train and fine_tune methods.
Multi-Label Predictions and Confidence Threshold in TARS models (#2430)
Adding the possibility to set confidence thresholds on multi-label prediction in TARS, and setting whether a problem is single-label or multi-label:
from flair.models import TARSClassifier
from flair.data import Sentence
# 1. Load our pre-trained TARS model for English
tars: TARSClassifier = TARSClassifier.load('tars-base')
# switch to a multi-label task (emotion detection)
tars.switch_to_task('GO_EMOTIONS')
# sentence with two emotions
sentence = Sentence("I am happy and sad")
# predict normally
tars.predict(sentence)
print(sentence)
# predict with lower label threshold (you can set this to 0. to get all labels)
tars.predict(sentence, label_threshold=0.01)
print(sentence)
# predict and enforce a single-label prediction
tars.predict(sentence, label_threshold=0.01, multi_label=False)
print(sentence)
Relation Extraction ( #2471 #2492)
We refactored the RelationExtractor for more options, hopefully better code clarity and small speed improvements.
We also added two few relation extraction models, trained over a modified version of TACRED: relations
and relations-fast
. To use these models, you also need an entity tagger. The tagger identifies entities, then the relation extractor possible entities.
For instance use this code:
from flair.data import Sentence
from flair.models import RelationExtractor, SequenceTagger
# 1. make example sentence
sentence = Sentence("George was born in Washington")
# 2. load entity tagger and predict entities
tagger = SequenceTagger.load('ner-fast')
tagger.predict(sentence)
# check which entities have been found in the sentence
entities = sentence.get_labels('ner')
for entity in entities:
print(entity)
# 3. load relation extractor
extractor: RelationExtractor = RelationExtractor.load('relations-fast')
# predict relations
extractor.predict(sentence)
# check which relations have been found
relations = sentence.get_labels('relation')
for relation in relations:
print(relation)
Embeddings
- Refactoring of WordEmbeddings to avoid gensim version issues and enable further fine-tuning of pre-trained embeddings (#2491)
- Refactoring of OneHotEmbeddings to fix errors caused by some corpora and enable "stable embeddings" (#2490 )
Other Enhancements and Bug Fixes
- Compatibility with gensim 4 and Python 3.9 (#2496)
- Fix TransformerWordEmbeddings if model_max_length not set in Tokenizer (#2502)
- Fix TransformerWordEmbeddings handling of lang ids (#2417)
- Fix attention mask for special Transformer architectures (#2485)
- Fix regression model (#2424)
- Fix problems caused by refactoring of Dictionary (#2429 #2435 #2453)
- Fix infinite loop in Span::to_original_text (#2462)
- Fix result object in ModelTrainer (#2519)
- Fix bug in wsd_ufsac corpus (#2521)
- Fix bugs in TARS and simple sequence tagger (#2468)
- Add Amharic FLAIR EMBEDDING model (#2494)
- Add MultiCoNer Dataset (#2507)
- Add Korean Flair Tutorials (#2516 #2517)
- Remove hyperparameter features (#2518)
- Make it optional to create logfiles and loss files (#2421)
- Small simplification of TransformerWordEmbeddings (#2425)
Release 0.9
With release 0.9 we are refactoring Flair for simplicity and speed, to make Flair faster and more easily scale to new NLP tasks. The first new tasks included in this release are Relation Extraction (RE), support for GLUE benchmark tasks and Entity Linking - all in beta for early adopters! We're working towards a Flair 1.0 release that will span the whole suite of standard NLP tasks. Also included is a new approach for Zero-Shot Sequence Labeling based on TARS! This release also includes a wealth of new datasets for all these tasks and tons of other new features and bug fixes.
Zero-Shot Sequence Labeling with TARS (#2260)
We extend the TARS zero-shot learning approach to sequence labeling and ship a pre-trained model for English NER. Try defining some classes and see if the model can find them:
# 1. Load zero-shot NER tagger
tars = TARSTagger.load('tars-ner')
# 2. Prepare some test sentences
sentences = [
Sentence("The Humboldt University of Berlin is situated near the Spree in Berlin, Germany"),
Sentence("Bayern Munich played against Real Madrid"),
Sentence("I flew with an Airbus A380 to Peru to pick up my Porsche Cayenne"),
Sentence("Game of Thrones is my favorite series"),
]
# 3. Define some classes of named entities such as "soccer teams", "TV shows" and "rivers"
labels = ["Soccer Team", "University", "Vehicle", "River", "City", "Country", "Person", "Movie", "TV Show"]
tars.add_and_switch_to_new_task('task 1', labels, label_type='ner')
# 4. Predict for these classes and print results
for sentence in sentences:
tars.predict(sentence)
print(sentence.to_tagged_string("ner"))
This should print:
The Humboldt <B-University> University <I-University> of <I-University> Berlin <E-University> is situated near the Spree <S-River> in Berlin <S-City> , Germany <S-Country>
Bayern <B-Soccer Team> Munich <E-Soccer Team> played against Real <B-Soccer Team> Madrid <E-Soccer Team>
I flew with an Airbus <B-Vehicle> A380 <E-Vehicle> to Peru <S-City> to pick up my Porsche <B-Vehicle> Cayenne <E-Vehicle>
Game <B-TV Show> of <I-TV Show> Thrones <E-TV Show> is my favorite series
So in these examples, we are finding entity classes such as "TV show" (Game of Thrones), "vehicle" (Airbus A380 and Porsche Cayenne), "soccer team" (Bayern Munich and Real Madrid) and "river" (Spree), even though the model was never explicitly trained for this. Note that this is ongoing research and the examples are a bit cherry-picked. We expect the zero-shot model to improve quite a bit until the next release.
New NLP Tasks and Datasets
We prototypically now support new tasks such as GLUE benchmark, Relation Extraction and Entity Linking. With this, we ship the datasets and model classes you need to train your own models. But we are still tweaking both methods, meaning that we don't ship any pre-trained models as-of-yet.
GLUE Benchmark (#2149 #2363)
A standard benchmark to evaluate progress in language understanding, mostly consisting of single and pairwise sentence classification tasks.
New datasets in Flair:
- 'GLUE_COLA' - The Corpus of Linguistic Acceptability from GLUE benchmark
- 'GLUE_MNLI' - The Multi-Genre Natural Language Inference Corpus from the GLUE benchmark
- 'GLUE_RTE' - The RTE task from the GLUE benchmark
- 'GLUE_QNLI' - The Stanford Question Answering Dataset formated as NLI task from the GLUE benchmark
- 'GLUE_WNLI' - The Winograd Schema Challenge formated as NLI task from the GLUE benchmark
- 'GLUE_MRPC' - The MRPC task from GLUE benchmark
- 'GLUE_QQP' - The Quora Question Pairs dataset where the task is to determine whether a pair of questions are semantically equivalent
Initialize datasets like so:
from flair.datasets import GLUE_QNLI
# load corpus
corpus = GLUE_QNLI()
# print corpus
print(corpus)
# print first sentence-pair of training data split
print(corpus.train[0])
# print all labels in corpus
print(corpus.make_label_dictionary("entailment"))
Relation Extraction (#2333 #2352)
Relation extraction classifies if and which relationship holds between two entities in a text.
Model class: RelationExtractor
Datasets in Flair:
- 'RE_ENGLISH_CONLL04' - the CoNLL-04 Relation Extraction dataset (#2333)
- 'RE_ENGLISH_SEMEVAL2010' - the SemEval-2010 Task 8 dataset on Multi-Way Classification of Semantic Relations Between Pairs of Nominals (#2333)
- 'RE_ENGLISH_TACRED' - the TAC Relation Extraction Dataset](https://nlp.stanford.edu/projects/tacred/) with 41 relations (download required) (#2333)
- 'RE_ENGLISH_DRUGPROT' - the DrugProt corpus from Biocreative VII Track 1 on drug and chemical-protein interactions (#2340 #2352)
Initialize datasets like so:
# initalize CoNLL 04 corpus for Relation extraction
corpus = RE_ENGLISH_CONLL04()
print(corpus)
# print first sentence of training split with annotations
sentence = corpus.train[0]
# print label dictionary
label_dict = corpus.make_label_dictionary("relation")
print(label_dict)
Entity Linking (#2375)
Entity Linking goes one step further than NER and uniquely links entities to knowledge bases such as Wikipedia.
Model class: EntityLinker
Datasets in Flair:
- 'NEL_ENGLISH_AIDA' - the AIDA CoNLL-YAGO Entity Linking corpus on the CoNLL-03 dataset for English
- 'NEL_ENGLISH_AQUAINT' - the Aquaint Entity Linking corpus introduced in Milne and Witten (2008)
- 'NEL_ENGLISH_IITB' - the ITTB Entity Linking corpus introduced in Sayali et al. (2009)
- 'NEL_ENGLISH_REDDIT' - the Reddit Entity Linking corpus introduced in Botzer et al. (2021) (only gold annotations)
- 'NEL_ENGLISH_TWEEKI' - the ITTB Entity Linking corpus introduced in Harandizadeh and Singh (2020)
- 'NEL_GERMAN_HIPE' - the HIPE Entity Linking corpus for historical German as a sentence-segmented version
from flair.datasets import NEL_ENGLISH_REDDIT
# load corpus
corpus = NEL_ENGLISH_REDDIT()
# print corpus
print(corpus)
# print a sentence of training data split
print(corpus.train[3])
New NER Datasets
- 'NER_ARABIC_ANER' - Arabic Named Entity Recognition Corpus 4-class NER (#2188)
- 'NER_ARABIC_AQMAR' - American and Qatari Modeling of Arabic 4-class NER (modified) (#2188)
- 'NER_ENGLISH_PERSON' - NER for person names (#2271)
- 'NER_ENGLISH_WEBPAGES' - 4-class NER on web pages from Ratinov and Roth (2009) (#2232 )
- 'NER_GERMAN_POLITICS' - NEMGP corpus for German politics (#2341)
- 'NER_JAPANESE' - Japanese NER dataset automatically generated from Wikipedia (#2154)
- 'NER_MASAKHANE' - MasakhaNER: Named Entity Recognition for African Languages corpora (#2212, #2214, #2227, #2229, #2230, #2231, #2222, #2234, #2242, #2243)
Other datasets
- 'YAHOO_ANSWERS' - The 10 largest main categories from the Yahoo! Answers (#2198)
- Various Universal Dependencies datasets (#2211, #2216, #2219, #2221, #2244, #2245, #2246, #2247, #2223, #2248, #2235, #2236, #2239, #2226)
New Functionality
Support for Arabic NER (#2188)
Flair now supports NER and POS tagging for Arabic. To tag an Arabic sentence, just load the appropriate model:
# load model
tagger = SequenceTagger.load('ar-ner')
# make Arabic sentence
sentence = Sentence("احب برلين")
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
for entity in sentence.get_labels('ner'):
print(entity)
This should print:
LOC [برلين (2)] (0.9803)
More flexibility on main metric (#2161)
When training models, you can now chose any standard evaluation metric for model selection (previously it was fixed to micro F1). When calling the trainer, simply pass the desired metric as main_evaluation_metric
like so:
trainer.train('resources/taggers/your_model',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=10,
main_evaluation_metric=("macro avg", 'f1-score'),
)
In this example, we now use macro F1 instead of the default micro F1.
Add handling for mapping labels to 'O' #2254
In ColumnDataset
, labels can be remapped to other labels. But sometimes you may not wish to use all label types in a given dataset.
You can now remap them to 'O' and so exclude them.
For instance, to load CoNLL-03 without MISC, do:
corpus = CONLL_03(
label_name_map={'MISC': 'O'}
)
print(corpus.make_label_dictionary('ner'))
print(corpus.train[0].to_tagged_string('ner'))
Other
Release 0.8
Release 0.8 adds major new features to Flair, including our best named entity recognition (NER) models yet and the ability to host, share and test Flair models on the HuggingFace model hub! In addition, there is a host of improvements, new features and new datasets to check out!
FLERT (#2031 #2032 #2104)
This release adds the "FLERT" approach to train sequence tagging models using cross-sentence features as presented in our recent paper. This yields new state-of-the-art models which we include in Flair, as well as the features to easily train your own "FLERT" models.
Pre-trained FLERT models (#2130)
We add 5 new NER models for English (4-class and 18-class), German, Dutch and Spanish (4-class each). Load for instance with:
from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = SequenceTagger.load("ner-large")
# make example sentence
sentence = Sentence("George Washington went to Washington")
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
If you want to test these models in action, for instance the new large English Ontonotes model with 18 classes, you can now use the hosted inference API on the HF model hub, like here.
Contextualized Sentences
In order to enable cross-sentence context, we made some changes to the Sentence object and data readers:
Sentence
objects now havenext_sentence()
andprevious_sentence()
methods that are set automatically if loaded throughColumnCorpus
. This is a pointer system to navigate through sentences in a corpus:
# load corpus
corpus = MIT_MOVIE_NER_SIMPLE(in_memory=False)
# get a sentence
sentence = corpus.test[123]
print(sentence)
# get the previous sentence
print(sentence.previous_sentence())
# get the sentence after that
print(sentence.next_sentence())
# get the sentence after the next sentence
print(sentence.next_sentence().next_sentence())
This allows dynamic computation of contexts in the embedding classes.
Sentence
objects now have theis_document_boundary
field which is set through theColumnCorpus
. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.
Refactored TransformerWordEmbeddings (breaking)
TransformerWordEmbeddings
refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: pooling_operation
is now subtoken_pooling
(to make clear that we pool subtokens), use_scalar_mean
is now layer_mean
(we only do a simple layer mean) and use_context
can now optionally take an integer to indicate the length of the context. Default arguments are also changed.
For instance, to create embeddings with a document-level context of 64 subtokens, init like this:
embeddings = TransformerWordEmbeddings(
model='bert-base-uncased',
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=64,
)
Train your Own FLERT Models
You can train a FLERT-model like this:
import torch
from flair.data import Sentence
from flair.datasets import CONLL_03, WNUT_17
from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
corpus = CONLL_03()
use_context = 64
hf_model = 'xlm-roberta-large'
embeddings = TransformerWordEmbeddings(
model=hf_model,
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=use_context,
)
tag_dictionary = corpus.make_tag_dictionary('ner')
# init bare-bones tagger (no reprojection, LSTM or CRF)
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type='ner',
use_crf=False,
use_rnn=False,
reproject_embeddings=False,
)
# train with XLM parameters (AdamW, 20 epochs, small LR)
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
from torch.optim.lr_scheduler import OneCycleLR
context_string = '+context' if use_context else ''
trainer.train(f"resources/flert",
learning_rate=5.0e-6,
mini_batch_size=4,
mini_batch_chunk_size=1,
max_epochs=20,
scheduler=OneCycleLR,
embeddings_storage_mode='none',
weight_decay=0.,
)
We recommend training FLERT this way if accuracy is by far the most important feature you need. FLERT is quite slow since it works on the document-level.
HuggingFace model hub integration (#2040 #2108 #2115)
We now host Flair sequence tagging models on the HF model hub (thanks for all the support @huggingface!).
Overview of all models. There is a dedicated 'Flair' tag on the hub, so to get a list of all Flair models, check here.
The hub allows all users to upload and share their own models. Even better, you can enable the Inference API and so test all models online without downloading and running them. For instance, you can test our new very powerful English 18-class NER model here.
To load any sequence tagger on the model hub, use the string identifier when instantiating a model. For instance, to load our English ontonotes model with the id "flair/ner-english-ontonotes-large", do
from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")
# make example sentence
sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
Other New Features
New Task: Recognizing Textual Entailment (#2123)
Thanks to @marcelmmm we now support training textual entailment tasks (in fact, all pairwise sentence classification tasks) in Flair.
For instance, if you want to train an RTE task of the GLUE benchmark use this script:
import torch
from flair.data import Corpus
from flair.datasets import GLUE_RTE
from flair.embeddings import TransformerDocumentEmbeddings
# 1. get the entailment corpus
corpus: Corpus = GLUE_RTE()
# 2. make the tag dictionary from the corpus
label_dictionary = corpus.make_label_dictionary()
# 3. initialize text pair tagger
from flair.models import TextPairClassifier
tagger = TextPairClassifier(
document_embeddings=TransformerDocumentEmbeddings(),
label_dictionary=label_dictionary,
)
# 4. train trainer with AdamW
from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
# 5. run training
trainer.train('resources/taggers/glue-rte-english',
learning_rate=2e-5,
mini_batch_chunk_size=2, # this can be removed if you hae a big GPU
train_with_dev=True,
max_epochs=3)
Add possibility to specify empty label name to CSV corpora (#2068)
Some CSV classification datasets contain a value that means "no class". We now extend the CSVClassificationDataset
so that it is possible to specify which value should be skipped using the no_class_label
argument.
For instance:
# load corpus
corpus = CSVClassificationCorpus(
data_folder='resources/tasks/code/',
train_file='java_io.csv',
skip_header=True,
column_name_map={3: 'text', 4: 'label', 5: 'label', 6: 'label', 7: 'label', 8: 'label', 9: 'label'},
no_class_label='NONE',
)
This causes all entries of NONE in one of the label columns to be skipped.
More options for splits in corpora and training (#2034)
For various reasons, we might want to have a Corpus
that does not define all three splits (train/dev/test). For instance, we might want to train a model over the entire dataset and not hold out any data for validation/evaluation.
We add several ways of doing so.
- If a dataset has predefined splits, like most NLP datasets, you can pass the arguments
train_with_test
andtrain_with_dev
to theModelTrainer
. This causes the trainer to train over all three splits (and do no evaluation):
trainer.train(f"path/to/your/folder",
learning_rate=0.1,
mini_batch_size=16,
train_with_dev=True,
train_with_test=True,
)
- You can also now create a Corpus with fewer splits without having all three splits automatically sampled. Pass
sample_missing_splits=False
as argument to do this. For instance, to load SemCor WSD corpus only as training data, do:
semcor = WSD_UFSAC(train_file='semcor.xml', sample_missing_splits=False, autofind_splits=False)
Add TFIDF Embeddings (#2086)
We added some old-school embeddings (thanks @yosipk), namely the legendary TF-IDF document embeddings. These are often good baselines, and additionally they keep NLP veterans nostalgic, if not happy.
To initialize these embeddings, you must pass the train split of your training corpus, i.e.
embeddings = DocumentTFIDFEmbeddings(corpus.train, max_features=10000)
This triggers the process where the most common words are used ...