Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NoiseBench #3512

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Add support for NoiseBench #3512

wants to merge 5 commits into from

Conversation

elenamer
Copy link
Collaborator

This PR adds support for NoiseBench, a subset of CoNLL-03 for English with 7 different label sets, including 6 different types of label noise.

Example usage (intialize dataset with a crowdsourced version of the NER labels):

corpus = NER_NOISEBENCH(noise="crowd")

Potential issues:

To create the dataset, we need the CleanCoNLL labels as gold labels. This required running an external script in this implementation: https://github.com/flairNLP/CleanCoNLL/blob/main/create_cleanconll_from_conll03.sh

**corpusargs: The arguments propagated to :meth:'flair.datasets.ColumnCorpus.__init__'.
"""
if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
raise Exception("Please choose a valid version")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to raise ValueError for wrong and unexpected values passed to the argument.

Also, an error message would be more helpful if it listed valid values. Something like:

if noise not in VALID_NOISE_VALUES:
    raise ValueError(f"Unsupported value for noise argument. Got {noise}, expected one of {VALUE_NOISE_VALUES}!")

if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
raise Exception("Please choose a valid version")

self._set_path(base_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this function and just set the base_path here in __init__:

if base_path:
    self.base_path = Path(base_path)
else:
    self.base_path = flair.cache_root / "datasets" / "noisebench"

self.base_path = flair.cache_root / "datasets" / "noisebench" if not base_path else Path(base_path)

@staticmethod
def read_column_file(filename):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are adding a new public method to NER_NOISEBENCH. I would avoid this. Also, you are relying on this method only in non-public methods in the class (except in generate_data_files, which I would also put as non-public.)

My suggestion is to rename the method to _read_column_file and add type annotations for filename argument.

return all_x

@staticmethod
def save_to_column_file(filename, list):
Copy link

@fkdosilovic fkdosilovic Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Rename to _save_to_column_file
  2. Add type annotations to arguments
  3. Avoid using the name of built-in modules/types/etc. for argument and variable names. Here, you are using list as the name for the argument. Seems like a more appropriate name is sentences (which is of type list).


def generate_data_files(self, filename):

with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this os.path.join since base_path is of type Path. Just use / to concatenate the path. Also, we should avoid relying on implicit default values, it is better to explicitly mention that we want to read the file with utf-8:

with open(self.base_path / "annotations_only" / "index.txt", "r", encoding="utf-8") as fp:
    # ...

with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:
token_indices = index_file.readlines()

all_clean_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.train"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.cleanconll_base_path / "cleanconll.train" instead of os.path.join.

)

# copy test set
all_clean_test_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.test"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.cleanconll_base_path / "cleanconll.test" instead of os.path.join.

new_s.append([token[0], token[4]])
test_sentences.append(new_s)

self.save_to_column_file(os.path.join(self.base_path, "clean.test"), test_sentences)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.base_path / "clean.test" instead of os.path.join.

new_s = []
for token in s:
new_s.append([token[0], token[4]])
test_sentences.append(new_s)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be refactored to a more idiomatic approach with list comprehensions:

for sentence in all_clean_test_sentences:
    new_sentence = [[tokens[0], tokens[4]] for tokens in sentence]
    test_sentences.append(new_sentence)

for token, label in zip(clean_sentence, sentence):
label[0] = token[0] # token[0] -> text, token[1] -> BIO label
if self.SAVE_TRAINDEV_FILE:
self.save_to_column_file(os.path.join(self.base_path, f"{corpus}.traindev"), noisy_labels)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.base_path / f"{corpus}.traindev" instead of os.path.join.


noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))
# print(noisy_labels)
# print(token_indices)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these debugging prints.

def _merge_tokens_labels(self, corpus, all_clean_sentences, token_indices):
# generate NoiseBench dataset variants, given CleanCoNLL, noisy label files and index file

noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.base_path / "annotations_only" / f"{corpus}.traindev" instead of os.path.join.

point = []
for line in lines:
if "\t" in line.strip():
stripped_line = line.strip().split("\t") if "\t" in line.strip() else line.strip().split(" ")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like stripped_line = line.strip().split("\t") if "\t" in line.strip() will always be set due to the condition if "\t" in line.strip() in line 5076. Refactor to:

if "\t" in line.strip():
    stripped_line = line.strip().split("\t")
else:
    stripped_line = line.strip().split(" ")

def save_to_column_file(filename, list):
with open(filename, "w") as f:
for sentence in list:
for token in sentence:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the iteration variable to tokens to convert the meaning that this is a container of elements and not a single token.

@alanakbik
Copy link
Collaborator

@fkdosilovic thanks a lot for your review, this is very helpful! Please note that this is still a draft PR and one major change is still planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants