Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce Table 7 from the paper #176

Open
pawasthy opened this issue Jan 9, 2025 · 1 comment
Open

Unable to reproduce Table 7 from the paper #176

pawasthy opened this issue Jan 9, 2025 · 1 comment

Comments

@pawasthy
Copy link

pawasthy commented Jan 9, 2025

Hi, I have been trying to replicate the BEIR scores in table 7 from the paper. I used the train_st.py script as is, and trained on 2 GPUs for 1 epoch like this accelerate launch --num_processes num_gpu train_st.py, and then evaluated it on BEIR using MTEB library.

I use the hyper-parameters suggested in Table 9 (lr 8e-5 for ModernBert, 5e-5 for bert-base), rest default values from the script. I am not able to replicate the numbers, any idea what could be the difference? Could you please list the hyperparams you use? How many GPUs, batch size, or any special arguments to pass to MTEB?

Thanks!

@NohTow
Copy link
Collaborator

NohTow commented Jan 13, 2025

Hello,

Sorry for the delayed answer.
I went to check our training scripts, and it appears the batch sizes written in the paper are not correct.
The setup is ; batch_size = 64 accumulation_step = 8 for base sizes and batch_size = 16 accumulation_step = 32 for large sizes. I am sorry about that and we'll correct the values in the paper.

Also, we use 8 GPUs, so I think you should use multiply gradient accumulation by 4 (so 32 for base) if you are using 2 GPUs (which should be pretty equivalent, as I believe ST do not use samples from the other GPUs with gathering).

Finally, please also note that the script released is leveraging CachedMultipleNegativesRankingLoss, which create bigger batch size by leveraging GradCache. Setting both per_device_train_batch_size and mini_batch_size to the target batch size should do the trick, but you might want to use loss = MultipleNegativesRankingLoss(model) if you want to have the exact same setup.

As for the evaluation, we did not use the MTEB library directly but a custom script attached below.
Sorry if the scripts on this repository are a bit different, they were more meant as accessible boilerplates for people wanting to do some tests than for replication, we should have created dedicated scripts.

from pylate import evaluation
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer(
        model_name_or_path=model_path,
    )

    documents, queries, qrels = evaluation.load_beir(
        dataset_name=dataset,
        split="test",
    )

    batch_size = 1000
    documents_embeddings = model.encode(
        sentences=[document["text"] for document in documents],
        batch_size=batch_size,
        show_progress_bar=True,
    )

    queries_embeddings = model.encode(
        sentences=queries,
        show_progress_bar=True,
        batch_size=16,
    )

    # Normalize document embeddings
    documents_embeddings = documents_embeddings / np.linalg.norm(documents_embeddings, axis=1, keepdims=True)
    
    # Normalize query embeddings 
    queries_embeddings = queries_embeddings / np.linalg.norm(queries_embeddings, axis=1, keepdims=True)
    similarity_matrix = np.dot(queries_embeddings, documents_embeddings.T)
    k = 10
    # Get top k scores and indices for each query
    top_k_scores = {}
    documents_ids = [document["id"] for document in documents]
    results = []
    
    # For each query
    for query_idx in range(similarity_matrix.shape[0]):
        scores = similarity_matrix[query_idx]
        # Get top k indices
        top_k_indices = np.argpartition(scores, -k)[-k:]
        # Sort them in descending order
        top_k_indices = top_k_indices[np.argsort(scores[top_k_indices])[::-1]]
        
        # Format results as list of dictionaries
        query_results = [
            {"id": documents_ids[idx], "score": float(scores[idx])}
            for idx in top_k_indices
        ]
        results.append(query_results)

    evaluation_scores = evaluation.evaluate(
        scores=results,
        qrels=qrels,
        queries=queries,
        metrics=["ndcg@10"],
    )

This should work for every dataset except cpadup, which we loaded with evaluation.load_custom_dataset and local files.
But running the MTEB evaluation script should give fairly close results as well hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants