Leaderboard: stella results don't match between new and old leaderboard #1753

Muennighoff · 2025-01-10T17:37:52Z

See #1571 (comment) ; Maybe a bug in our implementation that caused the difference when rerunning?

Muennighoff · 2025-01-10T17:38:34Z

Especially weird that in our runs the 400M is better than the 1.5B

Samoed · 2025-01-10T18:42:06Z

Maybe we should retest it with the same wrapper as jasper, because now it passes instruct to passages too, but this shouldn't happen

Samoed · 2025-01-11T06:12:46Z

Model: dunzhang/stella_en_1.5B_v5

Task Name	Old Leaderboard	New Leaderboard
AmazonCounterfactualClassification	92.87	92.87
AmazonPolarityClassification	97.16	97.16
AmazonReviewsClassification	59.36	59.36
ArguAna	65.27	65.27
ArxivClusteringP2P	55.44	55.44
ArxivClusteringS2S	50.66	50.66
AskUbuntuDupQuestions	67.33	67.33
BIOSSES	83.11	83.11
Banking77Classification	89.79	89.79
BiorxivClusteringP2P	50.68	50.68
BiorxivClusteringS2S	46.87	46.87
CQADupstackAndroidRetrieval	N/A	40.98
CQADupstackEnglishRetrieval	N/A	41.89
CQADupstackGamingRetrieval	N/A	53.59
CQADupstackGisRetrieval	N/A	25.99
CQADupstackMathematicaRetrieval	N/A	22.08
CQADupstackPhysicsRetrieval	N/A	47.59
CQADupstackProgrammersRetrieval	N/A	30.54
CQADupstackStatsRetrieval	N/A	32.78
CQADupstackTexRetrieval	N/A	16.91
CQADupstackUnixRetrieval	N/A	29.03
CQADupstackWebmastersRetrieval	N/A	30.72
CQADupstackWordpressRetrieval	N/A	18.42
ClimateFEVER	46.11	46.11
DBPedia	52.28	52.28
EmotionClassification	84.29	84.29
FEVER	94.83	94.83
FiQA2018	60.48	60.48
HotpotQA	76.67	76.67
ImdbClassification	96.66	96.66
MTOPDomainClassification	99.01	99.01
MTOPIntentClassification	92.78	92.78
MassiveIntentClassification	85.83	85.83
MassiveScenarioClassification	90.2	90.2
MedrxivClusteringP2P	46.87	46.87
MedrxivClusteringS2S	44.65	44.65
MindSmallReranking	33.05	33.05
NFCorpus	42.0	42.0
NQ	71.8	71.8
QuoraRetrieval	90.03	90.03
RedditClustering	75.27	72.86
RedditClusteringP2P	75.27	75.27
SCIDOCS	26.64	26.64
SICK-R	82.89	82.89
STS12	80.09	80.09
STS13	89.68	89.68
STS14	85.07	85.07
STS15	89.39	89.39
STS16	87.15	87.15
STS17	91.35	91.35
STS22	68.1	68.1
STSBenchmark	88.23	88.23
SciDocsRR	89.2	89.2
SciFact	80.09	80.09
SprintDuplicateQuestions	96.04	96.33
StackExchangeClustering	49.57	80.29
StackExchangeClusteringP2P	49.57	49.57
StackOverflowDupQuestions	55.25	55.25
SummEval	31.49	31.49
TRECCOVID	85.98	85.98
Touche2020	29.94	29.94
ToxicConversationsClassification	88.76	88.76
TweetSentimentExtractionClassification	74.84	74.84
TwentyNewsgroupsClustering	61.43	61.43
TwitterSemEval2015	80.58	80.58
TwitterURLCorpus	87.58	87.64
MSMARCO	45.22	45.22

Model: dunzhang/stella_en_400M_v5

Task Name	Old Leaderboard	New Leaderboard
AmazonCounterfactualClassification	92.36	92.36
AmazonPolarityClassification	97.19	97.19
AmazonReviewsClassification	59.53	59.53
ArguAna	64.24	64.24
ArxivClusteringP2P	55.16	55.16
ArxivClusteringS2S	49.82	49.82
AskUbuntuDupQuestions	66.15	66.15
BIOSSES	83.3	83.3
Banking77Classification	89.3	89.3
BiorxivClusteringP2P	50.68	50.68
BiorxivClusteringS2S	45.81	45.81
CQADupstackAndroidRetrieval	N/A	51.81
CQADupstackEnglishRetrieval	N/A	45.22
CQADupstackGamingRetrieval	N/A	57.02
CQADupstackGisRetrieval	N/A	38.4
CQADupstackMathematicaRetrieval	N/A	30.29
CQADupstackPhysicsRetrieval	N/A	47.58
CQADupstackProgrammersRetrieval	N/A	44.06
CQADupstackStatsRetrieval	N/A	36.38
CQADupstackTexRetrieval	N/A	29.86
CQADupstackUnixRetrieval	N/A	41.81
CQADupstackWebmastersRetrieval	N/A	42.46
CQADupstackWordpressRetrieval	N/A	32.86
ClimateFEVER	43.53	43.53
DBPedia	49.88	49.88
EmotionClassification	78.77	78.77
FEVER	90.99	90.99
FiQA2018	56.06	56.06
HotpotQA	71.74	71.74
ImdbClassification	96.49	96.49
MTOPDomainClassification	98.83	98.83
MTOPIntentClassification	92.3	92.3
MassiveIntentClassification	85.17	85.17
MassiveScenarioClassification	89.62	89.62
MedrxivClusteringP2P	46.32	46.32
MedrxivClusteringS2S	44.29	44.29
MindSmallReranking	33.05	33.05
NFCorpus	41.49	41.49
NQ	69.07	69.07
QuoraRetrieval	89.58	89.58
RedditClustering	74.42	71.19
RedditClusteringP2P	74.42	74.42
SCIDOCS	25.04	25.04
SICK-R	82.21	82.21
STS12	79.52	79.52
STS13	89.19	89.19
STS14	85.15	85.15
STS15	89.1	89.1
STS16	87.14	87.14
STS17	90.97	90.97
STS22	67.83	67.83
STSBenchmark	87.74	87.74
SciDocsRR	88.44	88.44
SciFact	78.23	78.23
SprintDuplicateQuestions	95.59	95.75
StackExchangeClustering	48.9	78.49
StackExchangeClusteringP2P	48.9	48.9
StackOverflowDupQuestions	52.99	52.99
SummEval	31.66	31.66
TRECCOVID	85.21	85.21
Touche2020	31.45	31.45
ToxicConversationsClassification	86.94	86.94
TweetSentimentExtractionClassification	73.58	73.58
TwentyNewsgroupsClustering	58.57	58.57
TwitterSemEval2015	80.18	80.19
TwitterURLCorpus	87.46	87.47
MSMARCO	43.69	43.69

KennethEnevoldsen · 2025-01-11T16:28:44Z

Seems to me like the scores match, but that the aggregation is different (Old benchmark aggregates "CQADupstack*Retrieval").

@x-tabdeveloping we could manually aggregate these for MTEB (would be a hotfix). I prober solution is #1231.

Originally posted by @KennethEnevoldsen in #1754 (comment)

Muennighoff mentioned this issue Jan 10, 2025

Leaderboard: SFR-Embedding results don't match between old and new #1754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard: stella results don't match between new and old leaderboard #1753

Leaderboard: stella results don't match between new and old leaderboard #1753

Muennighoff commented Jan 10, 2025

Muennighoff commented Jan 10, 2025

Samoed commented Jan 10, 2025

Samoed commented Jan 11, 2025

KennethEnevoldsen commented Jan 11, 2025

Leaderboard: stella results don't match between new and old leaderboard #1753

Leaderboard: stella results don't match between new and old leaderboard #1753

Comments

Muennighoff commented Jan 10, 2025

Muennighoff commented Jan 10, 2025

Samoed commented Jan 10, 2025

Samoed commented Jan 11, 2025

Model: dunzhang/stella_en_1.5B_v5

Model: dunzhang/stella_en_400M_v5

KennethEnevoldsen commented Jan 11, 2025