Skip to content

Commit

Permalink
Heavily improve automatic model card generation + Patch XLM-R (#28)
Browse files Browse the repository at this point in the history
* Uncomment pushing to the Hub

* Initial version to improve automatic model card generation

* Simplify label normalization

* Automatically select some eval sentences for the widget

* Improve language card

* Add automatic evaluation results

* Use dash instead of underscore in model name

* Add extra TODOs

* model.predict text as the first example

* Automatically set model name based on encoder & dataset

* Remove accidental Dataset import

* Rename examples to widget examples

* Add table with label examples

Also use fields instead of __dict__

* Ensure complete metadata

* Add tokenizer warning if punct must be split from words

* Remove dead code

* Rename poor variable names

* Fix incorrect warning

* Add " in the model labels

* Set model_id based on args if possible

* Add training set metrics

* Randomly select 100 samples for the widget examples

Instead of taking the first 100

* Prevent duplicate widget examples

* Remove completed TODO

* Use title case throughout model card

* Add useful comments if values not provided

Also prevent crash if dataset_id is not provided

* Add environmental impact with codecarbon

* Ensure that the model card template is included in the install

* Add training hardware section

* Add Python version

* Make everything title case

* Add missing docstring

* Add docstring for SpanMarkerModelCardData

* Update CHANGELOG

* Add SpanMarkerModelCardData to dunder init

* Add SpanMarkerModelCardData to snippets

* Resolve breaking error if hub_model_id is set

* gpu_model -> hardware_used

To better match what HF expects

* Add "base_model" to metadata

* Increment datasets min version to 2.14.0

Required for sorting on multiple columns at once

* Update trainer evaluate tests

* Skip old model card test for now

* Fix edge case: less than 5 examples

* pytest.skip -> pytest.mark.skip

* Try to infer the language from the dataset

* Add citations and hidden sections

* Refactor inferring language

* Remove unused import

* Add comment explaining version

* Override default Trainer create_model_card

* Update model card template slightly

* Add newline to model card template

* Remove incorrect space

* Add model card tests

* Improve Trainer tests regarding model card

* Remove commented out breakpoint

* Add codecarbon to CI

* Rename integration extra to codecarbon

* Make hardware_used optional (if no GPU present)

* Apply suggestions to model_card_template

Co-authored-by: Daniel van Strien <[email protected]>

* Update model card test pattern alongside template changes

* Don't include hardware_used when no GPU present

* Set "No GPU used" for GPU Model if hardware_used is None

* Don't store None in yaml

* Ensure that emissions is a regular float

* kgs to g

* support e-05 notation

* Add small test case for model cards

* Update model tables in docs

* Link to the spaCy integration in the tokenizer warning

* Update README snippet

* Update outdated docs: entity_max_length default is 8

* Remove /models from URL, caused 404s

* Fix outdated type hint

* 🎉 Apply XLM-R patch

* Remove /models from test

* Remove tokenizer warning after patch

* Update training docs with model card data etc.

* Pad token embeddings to multiple of 8

Removes a warning since transformers 4.32.0

* Always attach list directly to header

* Tackle edge case where dataset card has no metadata

* Allow installing nltk for detokenizing model card examples

* Add model card docs

* Mention codecarbon install in docstring

* overwrite the default codecarbon log level to "error"

* Update CHANGELOG

* Fix issue with inference example containing full quotes

* Update CHANGELOG

* Never print a model when printing SpanMarkerModelCardData

* Try to infer the dataset_id from the training set

Thanks @cakiki

* Update the main docs landing page

---------

Co-authored-by: Daniel van Strien <[email protected]>
  • Loading branch information
tomaarsen and davanstrien authored Sep 29, 2023
1 parent 506c25b commit 509d5f4
Show file tree
Hide file tree
Showing 29 changed files with 1,777 additions and 626 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
- name: Install external dependencies on cache miss
run: |
python -m pip install --no-cache-dir --upgrade pip
python -m pip install --no-cache-dir ".[dev]"
python -m pip install --no-cache-dir ".[dev, codecarbon]"
python -m spacy download en_core_web_sm
if: steps.restore-cache.outputs.cache-hit != 'true'

Expand Down
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,25 @@ Types of changes

### Added

- Added `SpanMarkerModel.generate_model_card()` method to get a model card string.
- Added `SpanMarkerModelCardData` that should be passed to `SpanMarkerModel.from_pretrained` with additional information like
- `language`, `license`, `model_name`, `model_id`, `encoder_name`, `encoder_id`, `dataset_name`, `dataset_id`, `dataset_revision`.
- Added `transformers` `pipeline` support, e.g. `pipeline(task="span-marker", model="tomaarsen/span-marker-mbert-base-multinerd")`.

### Changed

- Heavily improved automatic model card generated.
- Evaluating outside of training now returns per-label outputs instead of only "overall" F1, precision and recall.
- Warn if the used tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space.
- If so, then inference of that model will require the punctuation to be split from the words.
- Improve label normalization speed.
- Allow you to call SpanMarkerModel.from_pretrained with a pre-initialized SpanMarkerConfig.

### Fixed

- Fixed tokenization mismatch between training and inference for XLM-RoBERTa models: allows for normal inference of those models.
- Resolve niche bug when TrainingArguments are not provided.

## [1.3.0]

### Added
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include span_marker/model_card_template.md
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,32 +44,47 @@ Please have a look at our [Getting Started](notebooks/getting_started.ipynb) not
| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) | [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) |

```python
from pathlib import Path
from datasets import load_dataset
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer
from span_marker import SpanMarkerModel, Trainer, SpanMarkerModelCardData


def main() -> None:
# Load the dataset, ensure "tokens" and "ner_tags" columns, and get a list of labels
dataset = load_dataset("DFKI-SLT/few-nerd", "supervised")
dataset_id = "DFKI-SLT/few-nerd"
dataset_name = "FewNERD"
dataset = load_dataset(dataset_id, "supervised")
dataset = dataset.remove_columns("ner_tags")
dataset = dataset.rename_column("fine_ner_tags", "ner_tags")
labels = dataset["train"].features["ner_tags"].feature.names
# ['O', 'art-broadcastprogram', 'art-film', 'art-music', 'art-other', ...

# Initialize a SpanMarker model using a pretrained BERT-style encoder
model_name = "bert-base-cased"
encoder_id = "bert-base-cased"
model_id = f"tomaarsen/span-marker-{encoder_id}-fewnerd-fine-super"
model = SpanMarkerModel.from_pretrained(
model_name,
encoder_id,
labels=labels,
# SpanMarker hyperparameters:
model_max_length=256,
marker_max_length=128,
entity_max_length=8,
# Model card arguments
model_card_data=SpanMarkerModelCardData(
model_id=model_id,
encoder_id=encoder_id,
dataset_name=dataset_name,
dataset_id=dataset_id,
license="cc-by-sa-4.0",
language="en",
),
)

# Prepare the 🤗 transformers training arguments
output_dir = Path("models") / model_id
args = TrainingArguments(
output_dir="models/span_marker_bert_base_cased_fewnerd_fine_super",
output_dir=output_dir,
# Training Hyperparameters:
learning_rate=5e-5,
per_device_train_batch_size=32,
Expand All @@ -96,12 +111,13 @@ def main() -> None:
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("models/span_marker_bert_base_cased_fewnerd_fine_super/checkpoint-final")

# Compute & save the metrics on the test set
metrics = trainer.evaluate(dataset["test"], metric_key_prefix="test")
trainer.save_metrics("test", metrics)

# Save the final checkpoint
trainer.save_model(output_dir / "checkpoint-final")

if __name__ == "__main__":
main()
Expand All @@ -121,8 +137,6 @@ entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B
{'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]
```

<!-- Because this work is based on [PL-Marker](https://arxiv.org/pdf/2109.06067v5.pdf), you may expect similar results to its [Papers with Code Leaderboard](https://paperswithcode.com/paper/pack-together-entity-and-relation-extraction) results. -->

## Pretrained Models

All models in this list contain `train.py` files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the [training_scripts](training_scripts) directory.
Expand Down
17 changes: 17 additions & 0 deletions docs/api/span_marker.model_card.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@

:autogenerated:

..
This file is autogenerated by `sphinx-api`.
span_marker.model_card module
=============================

.. currentmodule:: span_marker.model_card

.. automodule:: span_marker.model_card
:members:
:exclude-members: hyperparameters, eval_results_dict, eval_lines_list, metric_lines, widget, predict_example, label_example_list, tokenizer_warning, train_set_metrics_list, code_carbon_callback, pipeline_tag, library_name, version, metrics, model, set_widget_examples, set_train_set_metrics, set_label_examples, register_model, is_on_huggingface, generate_model_card
:undoc-members:
:show-inheritance:
:member-order: bysource
1 change: 1 addition & 0 deletions docs/api/span_marker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ span_marker package
span_marker.modeling
span_marker.trainer
span_marker.configuration
span_marker.model_card
span_marker.pipeline_component
span_marker.data_collator
span_marker.tokenizer
Expand Down
Loading

0 comments on commit 509d5f4

Please sign in to comment.