Release 0.12
Release 0.12 is out! This release greatly simplifies model usage for our users, includes our first entity linking model, adds support for the Ukrainian language, adds easy-to-use multitask learning, and many more features, improvements and bug fixes!
New Features
Simplify Flair model usage #3067
You can now load any Flair model through its parent class. Since most models inherit from Classifier
, you can load and run multiple different models with exactly the same code. So, to run three different taggers for sentiment, entities and frames, do:
from flair.data import Sentence
from flair.nn import Classifier
# load three taggers to tag entities, frames and sentiment
tagger_1 = Classifier.load('ner')
tagger_2 = Classifier.load('frame')
tagger_3 = Classifier.load('sentiment')
# example sentence
sentence = Sentence('Dirk celebrated in Essen')
# predict with all three models
tagger_1.predict(sentence)
tagger_2.predict(sentence)
tagger_3.predict(sentence)
# print all predictions
for label in sentence.get_labels():
print(label)
With this change, users no longer need to know which model classes implement which model. For more advanced users who do know this, the regular way for loading a model still works:
sentiment_tagger = TextClassifier.load('sentiment')
Entity Linking (BETA)
As of Flair 0.12 we ship an experimental entity linker trained on the Zelda dataset. The linker not only tags entities, but also attempts to link each entity to the corresponding Wikipedia URL if one exists.
To illustrate, let's use a short example text with two mentions of "Barcelona". The first refers to the football club "FC Barcelona", the second to the city "Barcelona".
from flair.nn import Classifier
from flair.data import Sentence
# load the model
tagger = Classifier.load('linker')
# make a sentence
sentence = Sentence('Bayern played against Barcelona. The match took place in Barcelona.')
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence)
This should print:
Sentence[12]: "Bayern played against Barcelona. The match took place in Barcelona." → ["Bayern"/FC_Bayern_Munich, "Barcelona"/FC_Barcelona, "Barcelona"/Barcelona]
As we can see, the linker can resolve what the two mentions of "Barcelona" refer to:
- the first mention "Barcelona" is linked to "FC_Barcelona"
- the second mention "Barcelona" is linked to "Barcelona"
Additionally, the mention "Bayern" is linked to "FC_Bayern_Munich", telling us that here the football club is meant.
Entity linking support includes:
- Support for the ZELDA candidate lists #3108 #3111
- Support for the ZELDA training and evaluation dataset #3088
Support for Ukrainian language #3026
This version adds support for Ukrainian taggers, embeddings and datasets. For instance, to do NER and POS tagging of a Ukrainian sentence, do:
# Load Ukrainian NER and POS taggers
from flair.models import SequenceTagger
ner_tagger = SequenceTagger.load('ner-ukrainian')
pos_tagger = SequenceTagger.load('pos-ukrainian')
# Tag a sentence
from flair.data import Sentence
sentence = Sentence("Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди.")
ner_tagger.predict(sentence)
pos_tagger.predict(sentence)
print(sentence)
# ”Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди." →
# [“Сьогодні"/ADV, "в"/ADP, "Знам’янці"/LOC, "Знам’янці"/PROPN, "проживають”/VERB, "нащадки"/NOUN, "поета"/NOUN, "—"/PUNCT, "родина"/NOUN, "Шкоди”/PERS, "Шкоди"/PROPN, "."/PUNCT]
Multitask Learning (#2910 #3085 #3101)
We add support for multitask learning in Flair (closes #2508 and closes #1260) with hopefully a simple syntax to define multiple tasks that share parts of the model.
The most common part to share is the transformer, which you might want to fine-tune across several tasks. Instantiate a transformer embedding and pass it to two separate models that you instantiate as before:
# --- Embeddings that are shared by both models --- #
shared_embedding = TransformerDocumentEmbeddings("distilbert-base-uncased", fine_tune=True)
# --- Task 1: Sentiment Analysis (5-class) --- #
corpus_1 = SENTEVAL_SST_GRANULAR()
model_1 = TextClassifier(shared_embedding,
label_dictionary=corpus_1.make_label_dictionary("class"),
label_type="class")
# -- Task 2: Binary Sentiment Analysis on Customer Reviews -- #
corpus_2 = SENTEVAL_CR()
model_2 = TextClassifier(shared_embedding,
label_dictionary=corpus_2.make_label_dictionary("sentiment"),
label_type="sentiment",
)
# -- Define mapping (which tagger should train on which model) -- #
multitask_model, multicorpus = make_multitask_model_and_corpus(
[
(model_1, corpus_1),
(model_2, corpus_2),
]
)
# -- Create model trainer and train -- #
trainer = ModelTrainer(multitask_model, multicorpus)
trainer.fine_tune(f"resources/taggers/multitask_test")
The mapping part here defines which tagger should be trained on which corpus. By calling make_multitask_model_and_corpus
with a mapping, you get a corpus and model object that you can train as before.
Explicit context boundaries in Transformer embeddings #3073 #3078
We improve our FLERT model by now explicitly marking up context boundaries using a new [FLERT]
special token in our transformer embeddings. Our experiments show that the context marker leads to improved NER results:
Transformer | Context-Marker | CoNLL-03 Test F1 |
---|---|---|
bert-base-uncased | none | 91.52 +- 0.16 |
[SEP] |
91.38 +- 0.18 | |
[FLERT] |
91.56 +- 0.17 | |
xlm-roberta-large | none | 93.73 +- 0.2 |
[SEP] |
93.76 +- 0.13 | |
[FLERT] |
93.92 +- 0.14 |
In the table, none is the approach used in previous Flair versions. [SEP]
means using the standard separator symbol as context delimiter. [FLERT]
means using a new dedicated special token.
As [FLERT]
performs best in our experiments, the [FLERT]
context marker is now activated by default.
More details: Assume the current sentence is Peter Blackburn
and the previous sentence ends with to boycott British lamb .
, while the next sentence starts with BRUSSELS 1996-08-22 The European Commission
.
In this case,
- if
use_context_separator=False
, the embedding is produced from this string:to boycott British lamb . Peter Blackburn BRUSSELS 1996-08-22 The European Commission
- if
use_context_separator=True
, the embedding is produced from this stringto boycott British lamb . [FLERT] Peter Blackburn [FLERT] BRUSSELS 1996-08-22 The European Commission
Integrate transformer-smaller-training-vocab #3066
We integrate the transformer-smaller-training-vocab
library into the ModelTrainer
. With it, you can reduce the size of transformer models when training and evaluating models on specific datasets. This leads to faster training times and a smaller memory footprint. Documentation on this new feature will be added soon!
Masked Relation Classifier #2748 #2993 with various Encoding Strategies #3023 (BETA)
We now include BETA support a new type of relation extraction model that leads to much higher accuracies than our vanilla relation extraction, but increases computational costs. Documentation for this will be added as we iterate on the model.
ONNX compatible models #2640 #2643 #3041 #3075
This release continues the journey on making our models more ONNX compatible.
Other features
- Add push to Hub functionalities #2897
- Add layoutlm layoutxlm support and the the SROIE dataset #2980
- Convenience method for learning rate factor #2888 #2893
New Datasets
- Add fewnerd corpus #3103
- Add support for NERMuD 2023 Dataset #3087
- Adds ZELDA Entity Linking dataset #3088
- Added Ukrainian NER and UD datasets #3069
- Add support MasakhaNER v2 dataset #3013
- Add support for MultiCoNerV2 #3006
- Add support for new ICDAR Europeana NER Dataset #2911
- datasets: add support for HIPE-2022 #2735 #2827 #2805
Major refactorings
- Unify loss reduction by making sure that all losses are summed over all points, instead of averaged #2933 #2910
- Python 3.7 #2769
- Flatten DefaultClassifier interface #2978
- Restructure Tokenizer and Splitter modules #3002
- Refactor Token and Sentence Positional Properties #3001
- Seralization of embeddings #3011
Various Improvements
Enhancements
- add functionality for using proxies #3082
- add option not to shuffle the first epoch #3076
- improved Tars Context #3063
- release optimizer memory and fix legacy tokenization #3043
- add time elapsed to training printout #2983
- separate between token-lengths and sub-token lengths #2990
- small speed optimizations #2975
- change output of .text to original string #2974
- remove BAD_EPOCHS printout for most schedulers #2970
- warn if resuming with too low max_epochs & ' additional_epochs' parameter #2895
- embeddings: add support for T5 encoder models #2896
- add py.typed file for PEP-561 compatibility #2858
- tars classifier always predict something on single label #2838
- make add_unk optional and don't use it for ner #2839
- add deprecation warning for SentenceDataset rename #2819
- more precise type hint for eval_on_train_fraction #2811
- better handling for consecutive whitespaces in Sentence #2721(already in flair 0.11.3)
- remove unnecessary more-itertools pin #2730 (already in flair 0.11.3)
- add
exclude_labels
parameter to trainer.train #2724 (already in flair 0.11.3) - add option to force token-level predictions in SequenceTagger #2750 (already in flair 0.11.3)
Build
- unified test classes, to ensure that all models & embeddings have tested the basic functionality #2981
- add missing dependency pre-commit to requirements-dev.txt #3093
- fix pre-commit bug by upgrading to isort 5.11.5 #3106 #3107
- update pytest and flake8 versions #2741
- pytest flake precommit update #2820
- pin flake8 to v4 #2892
- specify test paths #2932
- pin versions for unit tests #2994
- unit tests: Set a seed so test_train_load_use_classifier doesn't randomly fail #2834
- replace issue templates with issue forms #3051
- github actions cache #2753 (already in flair 0.11.3)
Documentation
- Add Missing Import to Tutorial 5 #2902
- Documentation pointers #2927
- readme: fix BibTeX for FLERT paper #2806 #2821
- docs: mention HIPE-2022 in corpus tutorial #2807
Code improvements
- add return types to Model and Classifier #3121
- removed undefined names #3054 #3056
- add docstrings missing for ModelTrainer.train() parameters #2961
- remove "tag_to_bioes" (Sequence) Corpus parameter, as it is not used #2812
- update hf-hub version #2837
- use transformers sentencepiece requirement #2835
- replace deprecated logging.warn with logging.warning #2829
- various mypy issues #2822 #2845 #2905
- removed some model classes that were very beta: the DependencyParser, the DistancePredictor and the SimilarityLearner. #2910
- remove legacy TransformerXLEmbeddings class #2768 (already in flair 0.11.3)
Bug fixes
- fix train error missing dev split #3115
- fix Avg Pooling in the Entity Linker #3123
- call
super().__setstate__()
in Embeddings #3057 - remove konoha from requirements.txt #3060
- fix label alignment if the sentence contains invalid tokens #3052
- change indexing in TARSTagger predict #3058
- fix training sample count in UD English #3044
- fix comment parsing for conllu datasets #3020
- HunFlair: Fix loading of datasets #3030 #3029
- persist needs_manual_ocr #3012
- save initial hidden states in sequence tagger #3010
- do not save Path objects to model cards #2998
- make JsonlCorpus create span labels #2863
- JsonlDataset: Fix code that claims to set "O" labels to actually set them #2817
- relationClassifier fix #2986
- fix problem in loading TARSClassifier #2987
- add missing tab for tensorboard #2922
- fast tokenizer reload fix pt.2: Bloom model #2904
- fix transformer embeddings for sentence with trailing whitespace #2891
- added label_name parameter to render_ner_html #2850
- allow BIO evaluation on sequence tagger #2787
- refactorings for initialization from state dict #2846
- save and load "tag_format" for sequence tagger model #2840
- do not remove other labels of sentence for set_label on Token and Span #2831
- fix left-over cases of token.get_tag(), which was renamed #2815
- remove wrong boolean check for loading datasets RE_ENGLISH_CONLL04 #2779
- added missing property decorator in PooledFlairEmbeddings #2744 (already in flair 0.11.3)
- fix wrong initialisations of label (where data_type was missing) #2731 (already in flair 0.11.3)
- update gdown requirement, fix download for dataset NER_MULTI_WIKIANN #2757 (already in flair 0.11.3)
- make Span detection more robust #2752 (already in flair 0.11.3)