-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2] Refactor evaluators and Abstasks #1707
base: v2.0.0
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great additions!
Should we hardcode split names to avoid reproduction issues, as they are also passed in the constructor?
Yes
Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?
Hmm not sure what you refer to here
@@ -276,6 +261,7 @@ def clustering_downsample( | |||
dataset: DatasetDict, seed: int, max_samples_in_cluster: int = 2048 | |||
) -> DatasetDict: | |||
"""In cases where it is not possible to convert the dataset to a fast version, we can downsample the dataset to speed up the evaluation. | |||
Only used in ArXivHierarchicalClusteringP2P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could probably just reupload it and remove this part then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this function to ArXivHierarchicalClusteringP2P.v2
, because ArXivHierarchicalClusteringP2P
uses same dataset
@@ -31,6 +32,7 @@ def __init__( | |||
self.pairs = pair_columns | |||
self.n = len(sentences) | |||
self.sentences = sentences | |||
# TODO used only by BUCC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are probably already thinking this, but let us just re-upload it
|
||
self.k = k | ||
|
||
def __call__(self, model, test_cache=None): | ||
def __call__( | ||
self, model: Encoder, *, encode_kwargs: dict[str, Any] = {}, test_cache=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is test_cache a Path | None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is embeddings of test split
mteb/mteb/evaluation/evaluators/ClassificationEvaluator.py
Lines 145 to 153 in 8d033f3
if test_cache is None: | |
X_test = model.encode( | |
self.sentences_test, | |
task_name=self.task_name, | |
**self.encode_kwargs, | |
) | |
test_cache = X_test | |
else: | |
X_test = test_cache |
For each experiment there we sample N examples per label for train and use full split for testing
For now in classification can be used KNN, PytorchKNN and LogReg (default) and I think we can leave only LogReg |
Agree - Set it as a class attribute, so tasks can redefine it if needed (similar to evaluator for summeval) Then people can just redefine the class if needed. |
# Conflicts: # mteb/abstasks/AbsTaskClusteringFast.py # mteb/abstasks/AbsTaskSummarization.py
I’ve made |
Yea I have actually considered just making them into one class ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks - feel free to merge this
Tested some tasks on
Summarization results are slightly different; I'll look into that. Classification results seem flaky, but I think it's due to RNG. |
I've made some refactoring for the task evaluators, except for Retrieval, which still requires a significant overhaul.
Additionally, I have some suggestions for classification:
n_experiments
andk
become part of the task definitions without requiring changes, to avoid reproduction issues since these are currently passed?Checklist
make test
.make lint
.