Unable to reproduce Table 7 from the paper #176

pawasthy · 2025-01-09T21:53:53Z

Hi, I have been trying to replicate the BEIR scores in table 7 from the paper. I used the train_st.py script as is, and trained on 2 GPUs for 1 epoch like this accelerate launch --num_processes num_gpu train_st.py, and then evaluated it on BEIR using MTEB library.

I use the hyper-parameters suggested in Table 9 (lr 8e-5 for ModernBert, 5e-5 for bert-base), rest default values from the script. I am not able to replicate the numbers, any idea what could be the difference? Could you please list the hyperparams you use? How many GPUs, batch size, or any special arguments to pass to MTEB?

Thanks!

The text was updated successfully, but these errors were encountered:

NohTow · 2025-01-13T13:33:25Z

Hello,

Sorry for the delayed answer.
I went to check our training scripts, and it appears the batch sizes written in the paper are not correct.
The setup is ; batch_size = 64 accumulation_step = 8 for base sizes and batch_size = 16 accumulation_step = 32 for large sizes. I am sorry about that and we'll correct the values in the paper.

Also, we use 8 GPUs, so I think you should use multiply gradient accumulation by 4 (so 32 for base) if you are using 2 GPUs (which should be pretty equivalent, as I believe ST do not use samples from the other GPUs with gathering).

Finally, please also note that the script released is leveraging CachedMultipleNegativesRankingLoss, which create bigger batch size by leveraging GradCache. Setting both per_device_train_batch_size and mini_batch_size to the target batch size should do the trick, but you might want to use loss = MultipleNegativesRankingLoss(model) if you want to have the exact same setup.

As for the evaluation, we did not use the MTEB library directly but a custom script attached below.
Sorry if the scripts on this repository are a bit different, they were more meant as accessible boilerplates for people wanting to do some tests than for replication, we should have created dedicated scripts.

from pylate import evaluation
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer(
        model_name_or_path=model_path,
    )

    documents, queries, qrels = evaluation.load_beir(
        dataset_name=dataset,
        split="test",
    )

    batch_size = 1000
    documents_embeddings = model.encode(
        sentences=[document["text"] for document in documents],
        batch_size=batch_size,
        show_progress_bar=True,
    )

    queries_embeddings = model.encode(
        sentences=queries,
        show_progress_bar=True,
        batch_size=16,
    )

    # Normalize document embeddings
    documents_embeddings = documents_embeddings / np.linalg.norm(documents_embeddings, axis=1, keepdims=True)
    
    # Normalize query embeddings 
    queries_embeddings = queries_embeddings / np.linalg.norm(queries_embeddings, axis=1, keepdims=True)
    similarity_matrix = np.dot(queries_embeddings, documents_embeddings.T)
    k = 10
    # Get top k scores and indices for each query
    top_k_scores = {}
    documents_ids = [document["id"] for document in documents]
    results = []
    
    # For each query
    for query_idx in range(similarity_matrix.shape[0]):
        scores = similarity_matrix[query_idx]
        # Get top k indices
        top_k_indices = np.argpartition(scores, -k)[-k:]
        # Sort them in descending order
        top_k_indices = top_k_indices[np.argsort(scores[top_k_indices])[::-1]]
        
        # Format results as list of dictionaries
        query_results = [
            {"id": documents_ids[idx], "score": float(scores[idx])}
            for idx in top_k_indices
        ]
        results.append(query_results)

    evaluation_scores = evaluation.evaluate(
        scores=results,
        qrels=qrels,
        queries=queries,
        metrics=["ndcg@10"],
    )

This should work for every dataset except cpadup, which we loaded with evaluation.load_custom_dataset and local files.
But running the MTEB evaluation script should give fairly close results as well hopefully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce Table 7 from the paper #176

Unable to reproduce Table 7 from the paper #176

pawasthy commented Jan 9, 2025

NohTow commented Jan 13, 2025

Unable to reproduce Table 7 from the paper #176

Unable to reproduce Table 7 from the paper #176

Comments

pawasthy commented Jan 9, 2025

NohTow commented Jan 13, 2025