-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix leaderboard metrics and COIR tasks #26
Fix leaderboard metrics and COIR tasks #26
Conversation
# Conflicts: # EXTERNAL_MODEL_RESULTS.json # all_data_tasks/0/default.jsonl # all_data_tasks/33/default.jsonl # all_data_tasks/34/default.jsonl # all_data_tasks/36/default.jsonl # all_data_tasks/37/default.jsonl # all_data_tasks/38/default.jsonl # all_data_tasks/39/default.jsonl # all_data_tasks/40/default.jsonl # all_data_tasks/41/default.jsonl # all_data_tasks/42/default.jsonl # all_data_tasks/43/default.jsonl # all_data_tasks/44/default.jsonl # boards_data/bright/data_tasks/Retrieval/default.jsonl # boards_data/en/data_tasks/Classification/default.jsonl # boards_data/ru/data_overall/default.jsonl # boards_data/ru/data_tasks/Classification/default.jsonl # boards_data/ru/data_tasks/Clustering/default.jsonl # boards_data/ru/data_tasks/Reranking/default.jsonl # boards_data/ru/data_tasks/Retrieval/default.jsonl # boards_data/ru/data_tasks/STS/default.jsonl # refresh.py
# Conflicts: # all_data_tasks/0/default.jsonl # all_data_tasks/1/default.jsonl # all_data_tasks/10/default.jsonl # all_data_tasks/11/default.jsonl # all_data_tasks/12/default.jsonl # all_data_tasks/13/default.jsonl # all_data_tasks/15/default.jsonl # all_data_tasks/16/default.jsonl # all_data_tasks/17/default.jsonl # all_data_tasks/18/default.jsonl # all_data_tasks/19/default.jsonl # all_data_tasks/2/default.jsonl # all_data_tasks/20/default.jsonl # all_data_tasks/21/default.jsonl # all_data_tasks/22/default.jsonl # all_data_tasks/23/default.jsonl # all_data_tasks/26/default.jsonl # all_data_tasks/27/default.jsonl # all_data_tasks/28/default.jsonl # all_data_tasks/29/default.jsonl # all_data_tasks/3/default.jsonl # all_data_tasks/30/default.jsonl # all_data_tasks/37/default.jsonl # all_data_tasks/38/default.jsonl # all_data_tasks/39/default.jsonl # all_data_tasks/4/default.jsonl # all_data_tasks/5/default.jsonl # all_data_tasks/6/default.jsonl # all_data_tasks/8/default.jsonl # all_data_tasks/9/default.jsonl # boards_data/da/data_tasks/Classification/default.jsonl # boards_data/en/data_overall/default.jsonl # boards_data/en/data_tasks/Classification/default.jsonl # boards_data/en/data_tasks/Clustering/default.jsonl # boards_data/en/data_tasks/PairClassification/default.jsonl # boards_data/en/data_tasks/Reranking/default.jsonl # boards_data/en/data_tasks/Retrieval/default.jsonl # boards_data/en/data_tasks/STS/default.jsonl # boards_data/en/data_tasks/Summarization/default.jsonl # boards_data/fr/data_overall/default.jsonl # boards_data/fr/data_tasks/Classification/default.jsonl # boards_data/fr/data_tasks/Clustering/default.jsonl # boards_data/fr/data_tasks/PairClassification/default.jsonl # boards_data/fr/data_tasks/Reranking/default.jsonl # boards_data/fr/data_tasks/Retrieval/default.jsonl # boards_data/fr/data_tasks/STS/default.jsonl # boards_data/fr/data_tasks/Summarization/default.jsonl # boards_data/no/data_tasks/Classification/default.jsonl # boards_data/other-sts/data_tasks/STS/default.jsonl # boards_data/pl/data_overall/default.jsonl # boards_data/pl/data_tasks/Classification/default.jsonl # boards_data/pl/data_tasks/Clustering/default.jsonl # boards_data/pl/data_tasks/PairClassification/default.jsonl # boards_data/pl/data_tasks/Retrieval/default.jsonl # boards_data/pl/data_tasks/STS/default.jsonl # boards_data/se/data_tasks/Classification/default.jsonl # boards_data/zh/data_overall/default.jsonl # boards_data/zh/data_tasks/Classification/default.jsonl # boards_data/zh/data_tasks/Clustering/default.jsonl # boards_data/zh/data_tasks/PairClassification/default.jsonl # boards_data/zh/data_tasks/Reranking/default.jsonl # boards_data/zh/data_tasks/Retrieval/default.jsonl # boards_data/zh/data_tasks/STS/default.jsonl
After c21efc7, the metrics for datasets are taken from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love a check from @Muennighoff and @orionw, but otherwise I can't see any issues here.
@@ -20,7 +20,7 @@ tasks: | |||
task_description: "Clustering is the task of grouping similar documents together." | |||
PairClassification: | |||
icon: "🎭" | |||
metric: ap | |||
metric: max_ap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this cause issues with external results? (@Muennighoff I believe we have discussed this before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but I would add these metrics to refresh.py for compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed refresh.py, but I'll leave comment open until @Muennighoff review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed refresh.py, but I'll leave the comment open until @Muennighoff reviews it. But I rather left max_ap
in config, because after embeddings-benchmark/mteb#1037 there is no ap
in model results.
CI fix in embeddings-benchmark/results#30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the CI fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (once CI is fixed)
metric_description: "Spearman correlation based on the model's similarity metric (usually cosine)" | ||
task_description: "Semantic Textual Similarity is the task of determining how similar two texts are." | ||
Summarization: | ||
icon: "📜" | ||
metric: spearman | ||
metric: cosine_spearman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had changed this so that future models can have their own distance metrics and it does not have to be cosine - only the use of spearman would be the same across models; but since there are no such models yet I think, reverting this works for me! cc @KennethEnevoldsen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to left it as spearman?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current code allows submitting results of models with other distance metrics, then maybe yes; @KennethEnevoldsen probably knows best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For summarization:
"pearson"
"spearman"
"cosine_spearman"
"cosine_pearson"
"dot_spearman"
"dot_pearson"
I checked main_score for summarization tasks and they have cosine_spearman
as main_score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spearman will often just be cosine spearman, but I think it is nicer to leave to up to the model developer to choose their comparison metric. I.e. would leave it as spearman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the issue that metrics will filter based on metric specied in config, but in results there is no metric with name spearman
. I can extend metrics in results file to avoid this, but I don't know if it good solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case I am fine with keeping it as cosine_spearman
(we can make a change to custom similarity metrics at a later point)
# Conflicts: # all_data_tasks/0/default.jsonl # all_data_tasks/1/default.jsonl # all_data_tasks/10/default.jsonl # all_data_tasks/11/default.jsonl # all_data_tasks/12/default.jsonl # all_data_tasks/13/default.jsonl # all_data_tasks/16/default.jsonl # all_data_tasks/17/default.jsonl # all_data_tasks/18/default.jsonl # all_data_tasks/19/default.jsonl # all_data_tasks/2/default.jsonl # all_data_tasks/20/default.jsonl # all_data_tasks/21/default.jsonl # all_data_tasks/22/default.jsonl # all_data_tasks/3/default.jsonl # all_data_tasks/38/default.jsonl # all_data_tasks/39/default.jsonl # all_data_tasks/4/default.jsonl # all_data_tasks/5/default.jsonl # all_data_tasks/6/default.jsonl # all_data_tasks/8/default.jsonl # all_data_tasks/9/default.jsonl # boards_data/en/data_overall/default.jsonl # boards_data/en/data_tasks/Classification/default.jsonl # boards_data/en/data_tasks/Clustering/default.jsonl # boards_data/en/data_tasks/PairClassification/default.jsonl # boards_data/en/data_tasks/Reranking/default.jsonl # boards_data/en/data_tasks/Retrieval/default.jsonl # boards_data/en/data_tasks/STS/default.jsonl # boards_data/en/data_tasks/Summarization/default.jsonl # boards_data/fr/data_overall/default.jsonl # boards_data/fr/data_tasks/Classification/default.jsonl # boards_data/fr/data_tasks/Clustering/default.jsonl # boards_data/fr/data_tasks/PairClassification/default.jsonl # boards_data/fr/data_tasks/Reranking/default.jsonl # boards_data/fr/data_tasks/Retrieval/default.jsonl # boards_data/fr/data_tasks/STS/default.jsonl # boards_data/fr/data_tasks/Summarization/default.jsonl # boards_data/other-sts/data_tasks/STS/default.jsonl # boards_data/zh/data_overall/default.jsonl # boards_data/zh/data_tasks/Classification/default.jsonl # boards_data/zh/data_tasks/Clustering/default.jsonl # boards_data/zh/data_tasks/PairClassification/default.jsonl # boards_data/zh/data_tasks/Reranking/default.jsonl # boards_data/zh/data_tasks/Retrieval/default.jsonl # boards_data/zh/data_tasks/STS/default.jsonl
@KennethEnevoldsen CI is now passing |
Add models from embeddings-benchmark/results#19