Skip to content

Commit

Permalink
Merge branch 'master' into flairNLPgh-3488/save-column-corpus-to-files
Browse files Browse the repository at this point in the history
  • Loading branch information
chelseagzr authored Jul 30, 2024
2 parents 205e46d + e17ab12 commit 757f0ca
Show file tree
Hide file tree
Showing 16 changed files with 37 additions and 80 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ document embeddings, including our proposed [Flair embeddings](https://www.aclwe
* **A PyTorch NLP framework.** Our framework builds directly on [PyTorch](https://pytorch.org/), making it easy to
train your own models and experiment with new approaches using Flair embeddings and classes.

Now at [version 0.13.1](https://github.com/flairNLP/flair/releases)!
Now at [version 0.14.0](https://github.com/flairNLP/flair/releases)!


## State-of-the-Art Models
Expand Down Expand Up @@ -127,6 +127,7 @@ In particular:
- [Tutorial 1: Basic tagging](https://flairnlp.github.io/docs/category/tutorial-1-basic-tagging) → how to tag your text
- [Tutorial 2: Training models](https://flairnlp.github.io/docs/category/tutorial-2-training-models) → how to train your own state-of-the-art NLP models
- [Tutorial 3: Embeddings](https://flairnlp.github.io/docs/category/tutorial-3-embeddings) → how to produce embeddings for words and documents
- [Tutorial 4: Biomedical text](https://flairnlp.github.io/docs/category/tutorial-4-biomedical-text) → how to analyse biomedical text data

There is also a dedicated landing page for our [biomedical NER and datasets](/resources/docs/HUNFLAIR.md) with
installation instructions and tutorials.
Expand Down
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
# -- Project information -----------------------------------------------------
from sphinx_github_style import get_linkcode_resolve

version = "0.13.1"
release = "0.13.1"
version = "0.14.0"
release = "0.14.0"
project = "flair"
author = importlib_metadata.metadata(project)["Author"]
copyright = f"2023 {author}"
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial/tutorial-basics/entity-mention-linking.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ sentence = Sentence(
ner_tagger = Classifier.load("hunflair2")
ner_tagger.predict(sentence)

nen_tagger = EntityMentionLinker.load("disease-linker-no-ab3p")
nen_tagger = EntityMentionLinker.load("disease-linker")
nen_tagger.predict(sentence)

for tag in sentence.get_labels():
Expand Down
1 change: 0 additions & 1 deletion docs/tutorial/tutorial-basics/other-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,6 @@ We end this section with a list of all other models we currently ship with Flair
| '[frame](https://huggingface.co/flair/frame-english)' | Frame Detection | English | Propbank 3.0 | **97.54** (F1) |
| '[frame-fast](https://huggingface.co/flair/frame-english-fast)' | Frame Detection | English | Propbank 3.0 | **97.31** (F1) | (fast model)
| 'negation-speculation' | Negation / speculation |English | Bioscope | **80.2** (F1) |
| 'communicative-functions' | detecting function of sentence in research paper (BETA) | English| scholarly papers | |
| 'de-historic-indirect' | historical indirect speech | German | @redewiedergabe project | **87.94** (F1) | [redewiedergabe](https://github.com/redewiedergabe/tagger) | |
| 'de-historic-direct' | historical direct speech | German | @redewiedergabe project | **87.94** (F1) | [redewiedergabe](https://github.com/redewiedergabe/tagger) | |
| 'de-historic-reported' | historical reported speech | German | @redewiedergabe project | **87.94** (F1) | [redewiedergabe](https://github.com/redewiedergabe/tagger) | |
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial/tutorial-basics/part-of-speech-tagging.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ tagger.predict(sentence)
print(sentence)
```

## Tagging universal parts-of-speech (uPoS)​
## Tagging parts-of-speech in any language

Universal parts-of-speech are a set of minimal syntactic units that exist across languages. For instance, most languages
will have VERBs or NOUNs.
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorial/tutorial-basics/tagging-entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This tutorials shows you how to do named entity recognition, showcases various NER models, and provides a full list of all NER models in Flair.

## Tagging entities with our standard model
## Tagging entities with our standard model

Our standard model uses Flair embeddings and was trained over the English CoNLL-03 task and can recognize 4 different entity types. It offers a good tradeoff between accuracy and speed.

Expand Down Expand Up @@ -32,7 +32,7 @@ Sentence: "George Washington went to Washington ." → ["George Washington"/PER,

The printout tells us that two entities are labeled in this sentence: "George Washington" as PER (person) and "Washington" as LOC (location).

## Tagging entities with our best model
## Tagging entities with our best model

Our best 4-class model is trained using a very large transformer. Use it if accuracy is the most important to you, and speed/memory not so much.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial/tutorial-basics/tagging-sentiment.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This tutorials shows you how to do sentiment analysis in Flair.

## Tagging sentiment with our standard model
## Tagging sentiment with our standard model

Our standard sentiment analysis model uses distilBERT embeddings and was trained over a mix of corpora, notably
the Amazon review corpus, and can thus handle a variety of domains and language.
Expand Down
2 changes: 1 addition & 1 deletion flair/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
device = torch.device("cpu")

# global variable: version
__version__ = "0.13.1"
__version__ = "0.14.0"
"""The current version of the flair library installed."""

# global variable: arrow symbol
Expand Down
15 changes: 8 additions & 7 deletions flair/datasets/treebanks.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,12 @@ def __getitem__(self, index: int = 0) -> Sentence:
else:
with open(str(self.path_to_conll_file), encoding="utf-8") as file:
file.seek(self.indices[index])
sentence = self._read_next_sentence(file)
sentence_or_none = self._read_next_sentence(file)
sentence = sentence_or_none if isinstance(sentence_or_none, Sentence) else Sentence("")

return sentence

def _read_next_sentence(self, file) -> Sentence:
def _read_next_sentence(self, file) -> Optional[Sentence]:
line = file.readline()
tokens: List[Token] = []

Expand All @@ -139,13 +140,15 @@ def _read_next_sentence(self, file) -> Sentence:
current_multiword_first_token = 0
current_multiword_last_token = 0

newline_reached = False
while line:
line = line.strip()
fields: List[str] = re.split("\t+", line)

# end of sentence
if line == "":
if len(tokens) > 0:
newline_reached = True
break

# comments or ellipsis
Expand Down Expand Up @@ -205,20 +208,18 @@ def _read_next_sentence(self, file) -> Sentence:
if token_idx <= current_multiword_last_token:
current_multiword_sequence += token.text

# print(token)
# print(current_multiword_last_token)
# print(current_multiword_first_token)
# if multi-word equals component tokens, there should be no whitespace
if token_idx == current_multiword_last_token and current_multiword_sequence == current_multiword_text:
# go through all tokens in subword and set whitespace_after information
for i in range(current_multiword_last_token - current_multiword_first_token):
# print(i)
tokens[-(i + 1)].whitespace_after = 0
tokens.append(token)

line = file.readline()

return Sentence(tokens)
if newline_reached or len(tokens) > 0:
return Sentence(tokens)
return None


class UD_ENGLISH(UniversalDependenciesCorpus):
Expand Down
4 changes: 2 additions & 2 deletions flair/models/language_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ def initialize(matrix):

@classmethod
def load_language_model(cls, model_file: Union[Path, str], has_decoder=True):
state = torch.load(str(model_file), map_location=flair.device)
state = torch.load(str(model_file), map_location=flair.device, weights_only=False)

document_delimiter = state.get("document_delimiter", "\n")
has_decoder = state.get("has_decoder", True) and has_decoder
Expand All @@ -213,7 +213,7 @@ def load_language_model(cls, model_file: Union[Path, str], has_decoder=True):

@classmethod
def load_checkpoint(cls, model_file: Union[Path, str]):
state = torch.load(str(model_file), map_location=flair.device)
state = torch.load(str(model_file), map_location=flair.device, weights_only=False)

epoch = state.get("epoch")
split = state.get("split")
Expand Down
1 change: 0 additions & 1 deletion flair/models/multitask_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,6 @@ def _fetch_model(model_name) -> str:
hu_path: str = "https://nlp.informatik.hu-berlin.de/resources/models"

# biomedical models
model_map["bioner"] = "/".join([hu_path, "bioner", "hunflair.pt"])
model_map["hunflair"] = "/".join([hu_path, "bioner", "hunflair.pt"])
model_map["hunflair-paper"] = "/".join([hu_path, "bioner", "hunflair-paper.pt"])

Expand Down
2 changes: 1 addition & 1 deletion flair/models/prefixed_tagger.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ def augment_sentences(

@staticmethod
def _fetch_model(model_name) -> str:
huggingface_model_map = {"hunflair2": "hunflair/hunflair2-ner"}
huggingface_model_map = {"hunflair2": "hunflair/hunflair2-ner", "bioner": "hunflair/hunflair2-ner"}

# check if model name is a valid local file
if Path(model_name).exists():
Expand Down
64 changes: 8 additions & 56 deletions flair/models/sequence_tagger_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from flair.data import Dictionary, Label, Sentence, Span, get_spans_from_bio
from flair.datasets import DataLoader, FlairDatapointDataset
from flair.embeddings import TokenEmbeddings
from flair.file_utils import cached_path, hf_download, unzip_file
from flair.file_utils import cached_path, hf_download
from flair.models.sequence_tagger_utils.crf import CRF
from flair.models.sequence_tagger_utils.viterbi import ViterbiDecoder, ViterbiLoss
from flair.training_utils import store_embeddings
Expand Down Expand Up @@ -573,9 +573,9 @@ def _standard_inference(self, features: torch.Tensor, batch: List[Sentence], pro

return predictions, all_tags

def _all_scores_for_token(self, sentences: List[Sentence], scores: torch.Tensor, lengths: List[int]):
def _all_scores_for_token(self, sentences: List[Sentence], score_tensor: torch.Tensor, lengths: List[int]):
"""Returns all scores for each tag in tag dictionary."""
scores = scores.numpy()
scores = score_tensor.numpy()
tokens = [token for sentence in sentences for token in sentence]
prob_all_tags = [
[
Expand Down Expand Up @@ -686,6 +686,11 @@ def _fetch_model(model_name) -> str:
"ner-ukrainian": "dchaplinsky/flair-uk-ner",
# Language-specific POS models
"pos-ukrainian": "dchaplinsky/flair-uk-pos",
# Historic German
"de-historic-direct": "aehrm/redewiedergabe-direct",
"de-historic-indirect": "aehrm/redewiedergabe-indirect",
"de-historic-reported": "aehrm/redewiedergabe-reported",
"de-historic-free-indirect": "aehrm/redewiedergabe-freeindirect",
}

hu_path: str = "https://nlp.informatik.hu-berlin.de/resources/models"
Expand Down Expand Up @@ -757,59 +762,6 @@ def _fetch_model(model_name) -> str:
"information."
)

# special handling for the taggers by the @redewiegergabe project (TODO: move to model hub)
elif model_name == "de-historic-indirect":
model_file = flair.cache_root / cache_dir / "indirect" / "final-model.pt"
if not model_file.exists():
cached_path(
"http://www.redewiedergabe.de/models/indirect.zip",
cache_dir=cache_dir,
)
unzip_file(
flair.cache_root / cache_dir / "indirect.zip",
flair.cache_root / cache_dir,
)
model_path = str(flair.cache_root / cache_dir / "indirect" / "final-model.pt")

elif model_name == "de-historic-direct":
model_file = flair.cache_root / cache_dir / "direct" / "final-model.pt"
if not model_file.exists():
cached_path(
"http://www.redewiedergabe.de/models/direct.zip",
cache_dir=cache_dir,
)
unzip_file(
flair.cache_root / cache_dir / "direct.zip",
flair.cache_root / cache_dir,
)
model_path = str(flair.cache_root / cache_dir / "direct" / "final-model.pt")

elif model_name == "de-historic-reported":
model_file = flair.cache_root / cache_dir / "reported" / "final-model.pt"
if not model_file.exists():
cached_path(
"http://www.redewiedergabe.de/models/reported.zip",
cache_dir=cache_dir,
)
unzip_file(
flair.cache_root / cache_dir / "reported.zip",
flair.cache_root / cache_dir,
)
model_path = str(flair.cache_root / cache_dir / "reported" / "final-model.pt")

elif model_name == "de-historic-free-indirect":
model_file = flair.cache_root / cache_dir / "freeIndirect" / "final-model.pt"
if not model_file.exists():
cached_path(
"http://www.redewiedergabe.de/models/freeIndirect.zip",
cache_dir=cache_dir,
)
unzip_file(
flair.cache_root / cache_dir / "freeIndirect.zip",
flair.cache_root / cache_dir,
)
model_path = str(flair.cache_root / cache_dir / "freeIndirect" / "final-model.pt")

# for all other cases (not local file or special download location), use HF model hub
else:
model_path = hf_download(model_name)
Expand Down
8 changes: 6 additions & 2 deletions flair/models/sequence_tagger_utils/viterbi.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,10 +226,14 @@ def decode(
return tags, all_tags

def _all_scores_for_token(
self, scores: torch.Tensor, tag_sequences: torch.Tensor, lengths: torch.IntTensor, sentences: List[Sentence]
self,
score_tensor: torch.Tensor,
tag_sequences: torch.Tensor,
lengths: torch.IntTensor,
sentences: List[Sentence],
):
"""Returns all scores for each tag in tag dictionary."""
scores = scores.numpy()
scores = score_tensor.numpy()
for i_batch, (batch, tag_seq) in enumerate(zip(scores, tag_sequences)):
for i, (tag_id, tag_scores) in enumerate(zip(tag_seq, batch)):
tag_id_int = tag_id if isinstance(tag_id, int) else int(tag_id.item())
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ filterwarnings = [
'ignore:Deprecated call to `pkg_resources', # huggingface has deprecated calls.
'ignore:distutils Version classes are deprecated.', # faiss uses deprecated distutils.
'ignore:`resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.', # transformers calls deprecated hf_hub
"ignore:`torch.cuda.amp.GradScaler", # GradScaler changes in torch 2.3.0 but we want to be backwards compatible.
]
markers = [
"integration",
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

setup(
name="flair",
version="0.13.1",
version="0.14.0",
description="A very simple framework for state-of-the-art NLP",
long_description=Path("README.md").read_text(encoding="utf-8"),
long_description_content_type="text/markdown",
Expand Down

0 comments on commit 757f0ca

Please sign in to comment.