Release 0.6.1
Release 0.6.1 is bugfix release that fixes the issues caused by moving the server that originally hosted the Flair models. Additionally, this release adds a ton of new NER datasets, including the XTREME corpus for 40 languages, and a new model for NER on German-language legal text.
New Model: Legal NER (#1872)
Add legal NER model for German. Trained using the German legal NER dataset available here that can be loaded in Flair with the LER_GERMAN
corpus object.
Uses German Flair and FastText embeddings and gets 96.35 F1 score.
Use like this:
# load German LER tagger
tagger = SequenceTagger.load('de-ler')
# example text
text = "vom 6. August 2020. Alle Beschwerdeführer befinden sich derzeit gemeinsam im Urlaub auf der Insel Mallorca , die vom Robert-Koch-Institut als Risikogebiet eingestuft wird. Sie wollen am 29. August 2020 wieder nach Deutschland einreisen, ohne sich gemäß § 1 Abs. 1 bis Abs. 3 der Verordnung zur Testpflicht von Einreisenden aus Risikogebieten auf das SARS-CoV-2-Virus testen zu lassen. Die Verordnung sei wegen eines Verstoßes der ihr zugrunde liegenden gesetzlichen Ermächtigungsgrundlage, des § 36 Abs. 7 IfSG , gegen Art. 80 Abs. 1 Satz 1 GG verfassungswidrig."
sentence = Sentence(text)
# predict and print entities
tagger.predict(sentence)
for entity in sentence.get_spans('ner'):
print(entity)
New Datasets
Add XTREME and WikiANN corpora for multilingual NER (#1862)
These huge corpora provide training data for NER in 176 languages. You can either load the language-specific parts of it by supplying a language code:
# load German Xtreme
german_corpus = XTREME('de')
print(german_corpus)
# load French Xtreme
french_corpus = XTREME('fr')
print(french_corpus)
Or you can load the default 40 languages at once into one huge MultiCorpus by not providing a language ID:
# load Xtreme MultiCorpus for all
multi_corpus = XTREME()
print(multi_corpus)
Add Twitter NER Dataset (#1850)
Dataset of tweets annotated with NER tags. Load with:
# load twitter dataset
corpus = TWITTER_NER()
# print example tweet
print(corpus.test[0])
Add German Europarl NER Dataset (#1849)
Dataset of German-language speeches in the European parliament annotated with standard NER tags like person and location. Load with:
# load corpus
corpus = EUROPARL_NER_GERMAN()
print(corpus)
# print first test sentence
print(corpus.test[1])
Add MIT Restaurant NER Dataset (#1177)
Dataset of English restaurant reviews annotated with entities like "dish", "location" and "rating". Load with:
# load restaurant dataset
corpus = MIT_RESTAURANTS()
# print example sentence
print(corpus.test[0])
Add Universal Propositions Banks for French and German (#1866)
Our kickoff into supporting the Universal Proposition Banks adds the first two UP datasets to Flair. Load with:
# load German UP
corpus = UP_GERMAN()
print(corpus)
# print example sentence
print(corpus.dev[1])
Add Universal Dependencies Dataset for Chinese (#1880)
Adds the Kyoto dataset for Chinese. Load with:
# load Chinese UD dataset
corpus = UD_CHINESE_KYOTO()
# print example sentence
print(corpus.test[0])
Bug fixes
- Move models to HU server (#1834 #1839 #1842)
- Fix deserialization issues in transformer tokenizers #1865
- Documentation fixes (#1819 #1821 #1836 #1852)
- Add link to a repo with examples of Flair on GCP (#1825)
- Correct variable names (#1875)
- Fix problem with custom delimiters in ColumnDataset (#1876)
- Fix offensive language detection model (#1877)
- Correct Dutch NER model (#1881)