Skip to content

Release 0.4.3

Compare
Choose a tag to compare
@alanakbik alanakbik released this 26 Aug 18:26
· 4591 commits to master since this release
ff9846d

Release 0.4.3 includes a host of new features including transformer-based embeddings (roBERTa, XLNet, XLM, etc.), fine-tuneable FlairEmbeddings, crosslingual MUSE embeddings, new data loading/sampling methods, speed/memory optimizations, bug fixes and enhancements. It also begins a refactoring of interfaces that prepares more general applicability of Flair to other types of downstream tasks.

Embeddings

Transformer embeddings (#941 #972 #993)

Updates the old pytorch-pretrained-BERT library to the latest version of pytorch-transformers to support various new Transformer-based architectures for embeddings.

A total of 7 (new/updated) transformer-based embeddings can be used in Flair now:

from flair.embeddings import (
    BertEmbeddings,
    OpenAIGPTEmbeddings,
    OpenAIGPT2Embeddings,
    TransformerXLEmbeddings,
    XLNetEmbeddings,
    XLMEmbeddings,
    RoBERTaEmbeddings,
)

bert_embeddings = BertEmbeddings()
gpt1_embeddings = OpenAIGPTEmbeddings()
gpt2_embeddings = OpenAIGPT2Embeddings()
txl_embeddings = TransformerXLEmbeddings()
xlnet_embeddings = XLNetEmbeddings()
xlm_embeddings = XLMEmbeddings()
roberta_embeddings = RoBERTaEmbeddings()

Detailed benchmarks on the downsampled CoNLL-2003 NER dataset for English can be found in #873 .

Crosslingual MUSE Embeddings (#853)

Use the new MuseCrosslingualEmbeddings class to embed any sentence in one of 30 languages into the same embedding space. Behind the scenes the class first does language detection of the sentence to be embedded, and then embeds it with the appropriate language embeddings. If you train a classifier or sequence labeler with (only) this class, it will automatically work across all 30 languages, though quality may widely vary.

Here's how to embed:

# initialize embeddings
embeddings = MuseCrosslingualEmbeddings()

# two sentences in different languages
sentence_1 = Sentence("This red shoe is new .")
sentence_2 = Sentence("Dieser rote Schuh ist rot .")

# language code is auto-detected
print(sentence_1.get_language_code())
print(sentence_2.get_language_code())

# embed sentences
embeddings.embed([sentence_1, sentence_2])

# print similarities
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
for token_1, token_2 in zip (sentence_1, sentence_2):
    print(f"'{token_1.text}' and '{token_2.text}' similarity: {cos(token_1.embedding, token_2.embedding)}")

FastTextEmbeddings (#879 )

Adds FastTextEmbeddings capable of handling for oov words. Be warned though that these embeddings are huge. BytePairEmbeddings are much smaller and reportedly of similar quality so it is probably advisable to use those instead.

Fine-tuneable FlairEmbeddings (#922)

You can now fine-tune FlairEmbeddings on downstream tasks. You can fine-tune an existing LM by simply passing the fine_tune parameter in the FlairEmbeddings constructor, like this:

embeddings = FlairEmbeddings('news-foward', fine_tune=True)

You can also use this option to task-train a wholly new language model by passing an empty LanguageModel to the FlairEmbeddings constructor and the fine_tune parameter, like this:

# make an empty language model
language_model = LanguageModel(
    Dictionary.load('chars'),
    is_forward_lm=True,
    hidden_size=256,
    nlayers=1)

# init FlairEmbeddings to task-train this model
embeddings = FlairEmbeddings(language_model, fine_tune=True)

Optimizations

Automatic mixed precision support (#934)

Mixed precision training can significantly speed up training. It can now be enabled by setting use_amp=True in the trainer classes. For instance for training language models you can do:

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=256,
              mini_batch_size=256,
              max_epochs=10,
              use_amp=True)

In our experiments, we saw 3x speedup of training large language models though results vary depending on your model size and experimental setup.

Control memory / speed tradeoff during training (#891 #809).

This release introduces the embeddings_storage_mode parameter to the ModelTrainer class and predict() methods. This parameter can be one of 'none', 'cpu' and 'gpu' and allows you to control the tradeoff between memory usage and speed during training:

  • If set to 'none' all embeddings are deleted after usage - this has lowest memory requirements but means that embeddings need to be recomputed at each epoch of training potentially causing a slowdown.
  • If set to 'cpu' all embeddings are moved to CPU memory after usage. During training, this means that they only need to be moved back to GPU for the forward pass, and not recomputed so in many cases this is faster, but requires memory.
  • If set to 'gpu' all embeddings stay on GPU memory after computation. This eliminates memory shuffling during training, causing a speedup. However this option requires enough GPU memory to be available for all embeddings of the dataset.

To use this option during training, simply set the parameter:

        # initialize trainer
        trainer: ModelTrainer = ModelTrainer(tagger, corpus)
        trainer.train(
            "path/to/your/model",
            embeddings_storage_mode='gpu',
        )

This release also removes the FlairEmbeddings-specific disk-caching mechanism. In the future, a more general caching mechanism applicable to all embedding types may potentially be added as a fourth memory management option.

Speed-ups on in-memory datasets (#792)

A new DataLoader abstract base class used in Flair will speed up data loading for in-memory datasets.

Refactoring of interfaces (#891 #843)

This release also slims down interfaces of flair.nn.Model and adds a new DataPoint interface that is currently implemented by the Token and Sentence classes. The idea is to widen the applicability of Flair to other data types and other tasks. In the future, the DataPoint interface will for example also be implemented by an Image object and new downstream tasks added to Flair.

The release also slims down the evaluate() method in the flair.nn.Model interface to take a DataLoader instead of a group of parameters. And refactors the logging header logic. Both refactorings prepare adding new new downstream tasks to Flair in the near future.

Other features

Training Classifiers with CSV files (#826 #952 #967)

Adds the CSVClassificationCorpus so you can train classifiers directly from CSVs instead of first having to convert to FastText format. To load a CSV, you need to pass a column_name_map (like in ColumnCorpus), which indicates which column(s) in the CSV holds the text and which field(s) the label(s):

corpus = CSVClassificationCorpus(
    # path to the data folder containing train / test / dev files
    data_folder='path/to/data',
    # indicates which columns are text and labels
    column_name_map={4: "text", 1: "label_topic", 2: "label_subtopic"},
    # if CSV has a header, you can skip it
    skip_header=True)

Data sampling (#908)

We added the first (of many) data samplers that can be passed to the ModelTrainer to influence training. The ImbalancedClassificationDatasetSampler for instance will upsample rare classes and downsample common classes in a classification dataset. It may potentially help with imbalanced datasets. Call like this:

    # initialize trainer
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    trainer.train(
        'path/to/folder',
        learning_rate=0.1,
        mini_batch_size=32,
        sampler=ImbalancedClassificationDatasetSampler,
    )

There are two experimental chunk samplers (ChunkSampler and ExpandingChunkSampler) split a dataset into chunks and shuffle them. This preserves some ordering of the original data while also randomizing the data.

Visualization

  • Adds HTML vizualization of sequence labeling (#933). Call like this:
from flair.visual.ner_html import render_ner_html

tagger = SequenceTagger.load('ner')

sentence = Sentence(
    "Thibaut Pinot's challenge ended on Friday due to injury, and then Julian Alaphilippe saw "
    "his lead fall away. The BBC's Hugh Schofield in Paris reflects on 34 years of hurt."
)

tagger.predict(sentence)
html = render_ner_html(sentence)

with open("sentence.html", "w") as writer:
    writer.write(html)
  • Plotter now returns images for use in iPython notebooks (#943)
  • Initial TensorBoard support (#924)
  • Add pointer to Flair Visualizer (#1014)

Additional parameterization options

  • CharacterEmbeddings now let you specify number of hidden states and embedding size (#834)
embedding = CharacterEmbedding(char_embedding_dim=64, hidden_size_char=64)
  • Adds configuration option for minimal learning rate stopping criterion (#871)
  • num_workers is a parameter of LanguageModelTrainer (#962 )

Bug fixes / enhancements

  • Updates old pretrained models to remove old bugs / performance issues (#1017)
  • Fix error in RNN initialization in DocumentRNNEmbeddings (#793)
  • ELMoEmbeddings now use flair.device param (#825)
  • Fix download of TREC_6 dataset (#896)
  • Fix download of UD_GERMAN-HDT (#980)
  • Fix download of WikiNER_German (#1006)
  • Fix error in ColumnCorpus in which words that begin with hashtags were skipped as comments (#956)
  • Fix max_tokens_per_doc param in ClassificationCorpus (#991)
  • Simplify split rule in ColumnCorpus (#990)
  • Fix import error message for ELMoEmbeddings (#1019)
  • References to Persian language unified across embeddings (#773)
  • Updates most pre-trained models fixing quality issues / bugs (#800)
  • Clarifications in documentation (#803 #860 #868)
  • Fixes infinite loop for tokens without startpos (#1030)

Enhancements

  • Adds a learnable initial hidden state to SequenceTagger (#899)
  • Now keeps order of sentences in mini-batch when embedding (#866)
  • SequenceTagger now optionally returns a distribution of tag probabilities over all classes (#782 #949 #1016)
  • The model trainer now outputs a 'test.tsv' file that contains prediction of final model when done training (#771 )
  • Releases logging handler when finishing training a model (#799)
  • Fixes bad_epochs in training logs and no longer evaluates on test data at each epoch by default (#818 )
  • Convenience method to remove all empty sentences from a corpus (#795)