Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Refactor evaluators and Abstasks #1707

Open
wants to merge 17 commits into
base: v2.0.0
Choose a base branch
from
Open

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Jan 4, 2025

I've made some refactoring for the task evaluators, except for Retrieval, which still requires a significant overhaul.

Additionally, I have some suggestions for classification:

  • Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?
  • Should we hardcode split names to avoid reproduction issues, as they are also passed in the constructor?
  • Should n_experiments and k become part of the task definitions without requiring changes, to avoid reproduction issues since these are currently passed?

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great additions!

Should we hardcode split names to avoid reproduction issues, as they are also passed in the constructor?

Yes

Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?

Hmm not sure what you refer to here

@@ -276,6 +261,7 @@ def clustering_downsample(
dataset: DatasetDict, seed: int, max_samples_in_cluster: int = 2048
) -> DatasetDict:
"""In cases where it is not possible to convert the dataset to a fast version, we can downsample the dataset to speed up the evaluation.
Only used in ArXivHierarchicalClusteringP2P
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably just reupload it and remove this part then

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this function to ArXivHierarchicalClusteringP2P.v2, because ArXivHierarchicalClusteringP2P uses same dataset

mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskReranking.py Show resolved Hide resolved
mteb/abstasks/MultilingualTask.py Show resolved Hide resolved
@@ -31,6 +32,7 @@ def __init__(
self.pairs = pair_columns
self.n = len(sentences)
self.sentences = sentences
# TODO used only by BUCC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are probably already thinking this, but let us just re-upload it


self.k = k

def __call__(self, model, test_cache=None):
def __call__(
self, model: Encoder, *, encode_kwargs: dict[str, Any] = {}, test_cache=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is test_cache a Path | None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is embeddings of test split

if test_cache is None:
X_test = model.encode(
self.sentences_test,
task_name=self.task_name,
**self.encode_kwargs,
)
test_cache = X_test
else:
X_test = test_cache

For each experiment there we sample N examples per label for train and use full split for testing

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 8, 2025

Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?

For now in classification can be used KNN, PytorchKNN and LogReg (default) and I think we can leave only LogReg

@KennethEnevoldsen
Copy link
Contributor

I think we can leave only LogReg

Agree - Set it as a class attribute, so tasks can redefine it if needed (similar to evaluator for summeval)

Then people can just redefine the class if needed.

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 9, 2025

I’ve made AbsTaskClassification the parent class for AbsTaskMultilabelClassification because they share the evaluate and _calculate_metrics_from_split functions. I considered making AbsTaskClustering the parent class for AbsTaskClusteringFast, but since they only share _calculate_metrics_from_split, I don’t think it’s as important.

@KennethEnevoldsen
Copy link
Contributor

I’ve made AbsTaskClassification the parent class for AbsTaskMultilabelClassification because they share the evaluate and _calculate_metrics_from_split functions.

Yea I have actually considered just making them into one class (AbsTaskClassification is just a simplified AbsTaskMultilabelClassification)

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks - feel free to merge this

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 11, 2025

Tested some tasks on intfloat/multilingual-e5-small

Task name main v2.0.0 PR
AmazonCounterfactualClassification 0.715075 0.729701 0.729701
EmotionClassification 0.4064 0.3868 0.3868
ToxicConversationsClassification 0.621289 0.643701 0.643701
ArxivClusteringS2S 0.33326 0.33326 0.33326
RedditClustering 0.423392 0.423392 0.423392
SprintDuplicateQuestions 0.935791 0.935791 0.935791
TwitterSemEval2015 0.728198 0.728198 0.728198
STS22 0.642763 0.642763 0.642763
STSBenchmark 0.835894 0.642763 0.835894
STS16 0.834906 0.834906 0.834906
SummEval 0.293653 0.293653 0.29219

Summarization results are slightly different; I'll look into that. Classification results seem flaky, but I think it's due to RNG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants