Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaderboard: stella results don't match between new and old leaderboard #1753

Open
Muennighoff opened this issue Jan 10, 2025 · 4 comments
Open

Comments

@Muennighoff
Copy link
Contributor

See #1571 (comment) ; Maybe a bug in our implementation that caused the difference when rerunning?

@Muennighoff
Copy link
Contributor Author

Especially weird that in our runs the 400M is better than the 1.5B

@Samoed
Copy link
Collaborator

Samoed commented Jan 10, 2025

Maybe we should retest it with the same wrapper as jasper, because now it passes instruct to passages too, but this shouldn't happen

@Samoed
Copy link
Collaborator

Samoed commented Jan 11, 2025

Model: dunzhang/stella_en_1.5B_v5

Task Name Old Leaderboard New Leaderboard
AmazonCounterfactualClassification 92.87 92.87
AmazonPolarityClassification 97.16 97.16
AmazonReviewsClassification 59.36 59.36
ArguAna 65.27 65.27
ArxivClusteringP2P 55.44 55.44
ArxivClusteringS2S 50.66 50.66
AskUbuntuDupQuestions 67.33 67.33
BIOSSES 83.11 83.11
Banking77Classification 89.79 89.79
BiorxivClusteringP2P 50.68 50.68
BiorxivClusteringS2S 46.87 46.87
CQADupstackAndroidRetrieval N/A 40.98
CQADupstackEnglishRetrieval N/A 41.89
CQADupstackGamingRetrieval N/A 53.59
CQADupstackGisRetrieval N/A 25.99
CQADupstackMathematicaRetrieval N/A 22.08
CQADupstackPhysicsRetrieval N/A 47.59
CQADupstackProgrammersRetrieval N/A 30.54
CQADupstackStatsRetrieval N/A 32.78
CQADupstackTexRetrieval N/A 16.91
CQADupstackUnixRetrieval N/A 29.03
CQADupstackWebmastersRetrieval N/A 30.72
CQADupstackWordpressRetrieval N/A 18.42
ClimateFEVER 46.11 46.11
DBPedia 52.28 52.28
EmotionClassification 84.29 84.29
FEVER 94.83 94.83
FiQA2018 60.48 60.48
HotpotQA 76.67 76.67
ImdbClassification 96.66 96.66
MTOPDomainClassification 99.01 99.01
MTOPIntentClassification 92.78 92.78
MassiveIntentClassification 85.83 85.83
MassiveScenarioClassification 90.2 90.2
MedrxivClusteringP2P 46.87 46.87
MedrxivClusteringS2S 44.65 44.65
MindSmallReranking 33.05 33.05
NFCorpus 42.0 42.0
NQ 71.8 71.8
QuoraRetrieval 90.03 90.03
RedditClustering 75.27 72.86
RedditClusteringP2P 75.27 75.27
SCIDOCS 26.64 26.64
SICK-R 82.89 82.89
STS12 80.09 80.09
STS13 89.68 89.68
STS14 85.07 85.07
STS15 89.39 89.39
STS16 87.15 87.15
STS17 91.35 91.35
STS22 68.1 68.1
STSBenchmark 88.23 88.23
SciDocsRR 89.2 89.2
SciFact 80.09 80.09
SprintDuplicateQuestions 96.04 96.33
StackExchangeClustering 49.57 80.29
StackExchangeClusteringP2P 49.57 49.57
StackOverflowDupQuestions 55.25 55.25
SummEval 31.49 31.49
TRECCOVID 85.98 85.98
Touche2020 29.94 29.94
ToxicConversationsClassification 88.76 88.76
TweetSentimentExtractionClassification 74.84 74.84
TwentyNewsgroupsClustering 61.43 61.43
TwitterSemEval2015 80.58 80.58
TwitterURLCorpus 87.58 87.64
MSMARCO 45.22 45.22

Model: dunzhang/stella_en_400M_v5

Task Name Old Leaderboard New Leaderboard
AmazonCounterfactualClassification 92.36 92.36
AmazonPolarityClassification 97.19 97.19
AmazonReviewsClassification 59.53 59.53
ArguAna 64.24 64.24
ArxivClusteringP2P 55.16 55.16
ArxivClusteringS2S 49.82 49.82
AskUbuntuDupQuestions 66.15 66.15
BIOSSES 83.3 83.3
Banking77Classification 89.3 89.3
BiorxivClusteringP2P 50.68 50.68
BiorxivClusteringS2S 45.81 45.81
CQADupstackAndroidRetrieval N/A 51.81
CQADupstackEnglishRetrieval N/A 45.22
CQADupstackGamingRetrieval N/A 57.02
CQADupstackGisRetrieval N/A 38.4
CQADupstackMathematicaRetrieval N/A 30.29
CQADupstackPhysicsRetrieval N/A 47.58
CQADupstackProgrammersRetrieval N/A 44.06
CQADupstackStatsRetrieval N/A 36.38
CQADupstackTexRetrieval N/A 29.86
CQADupstackUnixRetrieval N/A 41.81
CQADupstackWebmastersRetrieval N/A 42.46
CQADupstackWordpressRetrieval N/A 32.86
ClimateFEVER 43.53 43.53
DBPedia 49.88 49.88
EmotionClassification 78.77 78.77
FEVER 90.99 90.99
FiQA2018 56.06 56.06
HotpotQA 71.74 71.74
ImdbClassification 96.49 96.49
MTOPDomainClassification 98.83 98.83
MTOPIntentClassification 92.3 92.3
MassiveIntentClassification 85.17 85.17
MassiveScenarioClassification 89.62 89.62
MedrxivClusteringP2P 46.32 46.32
MedrxivClusteringS2S 44.29 44.29
MindSmallReranking 33.05 33.05
NFCorpus 41.49 41.49
NQ 69.07 69.07
QuoraRetrieval 89.58 89.58
RedditClustering 74.42 71.19
RedditClusteringP2P 74.42 74.42
SCIDOCS 25.04 25.04
SICK-R 82.21 82.21
STS12 79.52 79.52
STS13 89.19 89.19
STS14 85.15 85.15
STS15 89.1 89.1
STS16 87.14 87.14
STS17 90.97 90.97
STS22 67.83 67.83
STSBenchmark 87.74 87.74
SciDocsRR 88.44 88.44
SciFact 78.23 78.23
SprintDuplicateQuestions 95.59 95.75
StackExchangeClustering 48.9 78.49
StackExchangeClusteringP2P 48.9 48.9
StackOverflowDupQuestions 52.99 52.99
SummEval 31.66 31.66
TRECCOVID 85.21 85.21
Touche2020 31.45 31.45
ToxicConversationsClassification 86.94 86.94
TweetSentimentExtractionClassification 73.58 73.58
TwentyNewsgroupsClustering 58.57 58.57
TwitterSemEval2015 80.18 80.19
TwitterURLCorpus 87.46 87.47
MSMARCO 43.69 43.69

@KennethEnevoldsen
Copy link
Contributor

Seems to me like the scores match, but that the aggregation is different (Old benchmark aggregates "CQADupstack*Retrieval").

@x-tabdeveloping we could manually aggregate these for MTEB (would be a hotfix). I prober solution is #1231.

Originally posted by @KennethEnevoldsen in #1754 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants