New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add support for NoiseBench #3512

Draft

elenamer wants to merge 5 commits into master from add_noisebench

Collaborator

elenamer commented Jul 24, 2024

This PR adds support for NoiseBench, a subset of CoNLL-03 for English with 7 different label sets, including 6 different types of label noise.

Example usage (intialize dataset with a crowdsourced version of the NER labels):

corpus = NER_NOISEBENCH(noise="crowd")

Potential issues:

To create the dataset, we need the CleanCoNLL labels as gold labels. This required running an external script in this implementation: https://github.com/flairNLP/CleanCoNLL/blob/main/create_cleanconll_from_conll03.sh

elenamer added 5 commits

July 23, 2024 22:59


          add intial noisebench dataset class

f82969d


          add downloading of cleanconll

2ede719


          update __init__

a34d2b2


          add processing of noisebench label sets

2684e93


          fix formatting

0e8183d

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                          **corpusargs: The arguments propagated to :meth:'flair.datasets.ColumnCorpus.__init__'.
+                      """
+                      if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
+                          raise Exception("Please choose a valid version")

fkdosilovic Oct 31, 2024

It is recommended to raise ValueError for wrong and unexpected values passed to the argument.

Also, an error message would be more helpful if it listed valid values. Something like:

if noise not in VALID_NOISE_VALUES:
    raise ValueError(f"Unsupported value for noise argument. Got {noise}, expected one of {VALUE_NOISE_VALUES}!")

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                      if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
+                          raise Exception("Please choose a valid version")
+                      self._set_path(base_path)

fkdosilovic Oct 31, 2024

I would remove this function and just set the base_path here in __init__:

if base_path:
    self.base_path = Path(base_path)
else:
    self.base_path = flair.cache_root / "datasets" / "noisebench"

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                      self.base_path = flair.cache_root / "datasets" / "noisebench" if not base_path else Path(base_path)
+                  @staticmethod
+                  def read_column_file(filename):

fkdosilovic Oct 31, 2024

You are adding a new public method to NER_NOISEBENCH. I would avoid this. Also, you are relying on this method only in non-public methods in the class (except in generate_data_files, which I would also put as non-public.)

My suggestion is to rename the method to _read_column_file and add type annotations for filename argument.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                      return all_x
+                  @staticmethod
+                  def save_to_column_file(filename, list):

fkdosilovic Oct 31, 2024 •

edited

Loading

Rename to _save_to_column_file
Add type annotations to arguments
Avoid using the name of built-in modules/types/etc. for argument and variable names. Here, you are using list as the name for the argument. Seems like a more appropriate name is sentences (which is of type list).

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py


		def generate_data_files(self, filename):

		with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:

fkdosilovic Oct 31, 2024

You can avoid this os.path.join since base_path is of type Path. Just use / to concatenate the path. Also, we should avoid relying on implicit default values, it is better to explicitly mention that we want to read the file with utf-8:

with open(self.base_path / "annotations_only" / "index.txt", "r", encoding="utf-8") as fp:
    # ...

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                      with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:
+                          token_indices = index_file.readlines()
+                          all_clean_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.train"))

fkdosilovic Oct 31, 2024

Same here, use self.cleanconll_base_path / "cleanconll.train" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                          )
+                      # copy test set
+                      all_clean_test_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.test"))

fkdosilovic Oct 31, 2024

Same here, use self.cleanconll_base_path / "cleanconll.test" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                              new_s.append([token[0], token[4]])
+                          test_sentences.append(new_s)
+                      self.save_to_column_file(os.path.join(self.base_path, "clean.test"), test_sentences)

fkdosilovic Oct 31, 2024

Same here, use self.base_path / "clean.test" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                          new_s = []
+                          for token in s:
+                              new_s.append([token[0], token[4]])
+                          test_sentences.append(new_s)

fkdosilovic Oct 31, 2024

This can be refactored to a more idiomatic approach with list comprehensions:

for sentence in all_clean_test_sentences:
    new_sentence = [[tokens[0], tokens[4]] for tokens in sentence]
    test_sentences.append(new_sentence)

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                          for token, label in zip(clean_sentence, sentence):
+                              label[0] = token[0]  # token[0] -> text, token[1] -> BIO label
+                      if self.SAVE_TRAINDEV_FILE:
+                          self.save_to_column_file(os.path.join(self.base_path, f"{corpus}.traindev"), noisy_labels)

fkdosilovic Oct 31, 2024

Same here, use self.base_path / f"{corpus}.traindev" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                      noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))
+                      # print(noisy_labels)
+                      # print(token_indices)

fkdosilovic Oct 31, 2024

Remove these debugging prints.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                  def _merge_tokens_labels(self, corpus, all_clean_sentences, token_indices):
+                      # generate NoiseBench dataset variants, given CleanCoNLL, noisy label files and index file
+                      noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))

fkdosilovic Oct 31, 2024

Same here, use self.base_path / "annotations_only" / f"{corpus}.traindev" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                          point = []
+                          for line in lines:
+                              if "\t" in line.strip():
+                                  stripped_line = line.strip().split("\t") if "\t" in line.strip() else line.strip().split(" ")

fkdosilovic Oct 31, 2024

Looks like stripped_line = line.strip().split("\t") if "\t" in line.strip() will always be set due to the condition if "\t" in line.strip() in line 5076. Refactor to:

if "\t" in line.strip():
    stripped_line = line.strip().split("\t")
else:
    stripped_line = line.strip().split(" ")

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                  def save_to_column_file(filename, list):
+                      with open(filename, "w") as f:
+                          for sentence in list:
+                              for token in sentence:

fkdosilovic Oct 31, 2024

Rename the iteration variable to tokens to convert the meaning that this is a container of elements and not a single token.

Collaborator

alanakbik commented Oct 31, 2024

@fkdosilovic thanks a lot for your review, this is very helpful! Please note that this is still a draft PR and one major change is still planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet