Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the leaderboard to better measure OOD performance #41

Open
orionw opened this issue Sep 25, 2024 · 2 comments
Open

Change the leaderboard to better measure OOD performance #41

orionw opened this issue Sep 25, 2024 · 2 comments

Comments

@orionw
Copy link
Collaborator

orionw commented Sep 25, 2024

@x-tabdeveloping is working on the new leaderboard here with awesome progress towards making it customizable (e.g. "select your own benchmark").

Along with this, a common theme I heard at SIGIR, on twitter, and in conversations with others is the complaint that BEIR (and MTEB in general) was supposed to be zero-shot and now most SOTA models train on all the training sets and use dev/test sets of BEIR as validation data. This of course makes it trivial to overfit (as also shown by the MTEB Arena).

One way we could better measure OOD performance is to tag certain models as only having trained on "approved" in-domain data while the test data is purely out of domain. This could be something like "MS MARCO" training is allowed and the evaluation is done on BEIR (minus MS MARCO). The exact specifics of allowed data would need to be worked out (what about synthetic generation, NQ, etc.).

I do not work for any company that creates embeddings as an API and for these groups I can see and understand the reasoning behind training on all available good training sets. However, I think for good science and evaluation, it seems like we should encourage a distinct split where datasets are not used for validation/training in order to measure true OOD performance. Otherwise it's getting hard to tell what are actually improvements and which models are better at not filtering the test data out (or at overfitting to the test data by using mini-versions of the test set for validation).

I believe that the MTEB leaderboard could be a driving force behind this change, if we want it to be. One way could be making the default leaderboard this OOD one, where only models with approved data would be shown. And we would of course still have a tab where all data is fair game.

However, as someone who doesn't work at these companies I likely have a biased perspective and would love to hear from others.

@bwanglzu
Copy link

bwanglzu commented Sep 25, 2024

Can not agree more. Let's take CMTEB as example, I highly suspect not even training set, but also test set is being used.

I think some basic descriptive statistics can already help: taking avg/median and mark models has a suspiciously high score in the UI, and resolve unless author disclose information & resolve the concern.

@Liuhong99
Copy link
Contributor

Liuhong99 commented Sep 25, 2024

Agreed! For retrieval, it seems a lot of models use Msmarco, NQ, HotpotQA, DBPedia, FEVER, Quora, FiQA, and SciFact training set as indicated in their papers or reports. For classification, especially EmotionClassification, I eyeballed the dataset and also asked GPT to label some samples. My rough estimation is that there are at least 20% noisy labels in the test set. It's generally mysterious how models get above 80% accuracy on this task.

Screenshot of SFR embedding blogpost:
image (7)

Screenshot of NV-Embed paper (I think ArguAna only has a test set):
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants