Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating ChemTEB #1708

Open
wants to merge 78 commits into
base: main
Choose a base branch
from
Open

Integrating ChemTEB #1708

wants to merge 78 commits into from

Conversation

HSILA
Copy link
Contributor

@HSILA HSILA commented Jan 5, 2025

As discussed in issue #1585, I have integrated ChemTEB into the MTEB codebase and submitted this PR for merging. I have added 28 tasks and two model families: amazon_models.py and cohere_bedrock_models.py. These models were included to meet our internal requirements, as they are not currently available in MTEB. I thought it might be good to have them, but I can remove them if necessary. I have tested the mentioned models with all the tasks, you can see the result here: chemteb_results_pr.csv

Closes #1585

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • Updating tables in docs/benchmarks.md and docs/tasks.md
  • Adding ChemTEB as a benchmark in benchmark.py

Adding datasets checklist

Reason for dataset addition: To address the lack of benchmarks tailored to chemical sciences, where text involves complex entity names, expressions, and specialized representations like SMILES codes, not typically found in general domain datasets.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using mteb.get_model(model_name, revision) and mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

HSILA and others added 30 commits August 10, 2024 17:47
HSILA and others added 17 commits December 15, 2024 23:19
- Integrate models added in ChemTEB (such as amazon, cohere bedrock and nomic bert) with latest modeling format in mteb.
- Update the metadata for the mentioned models
`open_weights` argument is repeated twice
- Rename some tasks for better readability.
- Merge some BitextMining and PairClassification tasks into a single task with subsets (`PubChemSMILESBitextMining` and `PubChemSMILESPC`)
- Add a new multilingual task (`PubChemWikiPairClassification`) consisting of 12 languages.
- Update dataset paths, revisions and metadata for most tasks.
- Add a `Chemistry` domain to `TaskMetadata`
- Move `PubChemSMILESBitextMining` to `eng` folder
- Add citations for tasks involving SDS, NQ, Hotpot, PubChem data
- Update Clustering tasks `category`
- Change `main_score` for `PubChemAISentenceParaphrasePC`
- `eval_langs` fixed
- Dataset path was fixed for two datasets
- Metadata was completed for all tasks, mainly following fields: `date`, `task_subtypes`, `dialect`, `sample_creation`
- ruff lint
- rename `nomic_bert_models.py` to `nomic_bert_model.py` and update it.
Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?

mteb/models/cohere_bedrock_models.py Outdated Show resolved Hide resolved
mteb/tasks/BitextMining/eng/PubChemSMILESBitextMining.py Outdated Show resolved Hide resolved
mteb/tasks/PairClassification/eng/PubChemSynonymPC.py Outdated Show resolved Hide resolved
@HSILA
Copy link
Contributor Author

HSILA commented Jan 5, 2025

Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?

Hi, thank you for the review. I haven't worked with the mentioned repository before. Could you clarify how I should proceed? What do you mean by "paper"?

@Samoed
Copy link
Collaborator

Samoed commented Jan 5, 2025

You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1

@HSILA
Copy link
Contributor Author

HSILA commented Jan 5, 2025

You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1

The detailed score for each task-model pair is not reported in our paper (I have it locally though); instead, an average per category is provided. Additionally, the current PR does not include the exact same tasks: three tasks have been removed, one new multilingual task has been added, and eight tasks have been merged into two. Therefore, the average scores may not align perfectly. However, I can run the benchmark and push the scores to the results repository.

- Text should be truncated for amazon text embedding models.
- `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens.
- Two datasets are updated, dropping very long samples (len > 99th percentile)
@HSILA
Copy link
Contributor Author

HSILA commented Jan 8, 2025

I updated bedrock_models.py to truncate sentences if they exceed Amazon's context length. For Cohere models in Bedrock, the limit is exactly 2048 characters, while for Amazon models, it is 8192 tokens. However, there is no straightforward way to calculate the number of tokens for a piece of text (like tiktoken does for OpenAI models) unless you send a long text to the API and it returns an error with the exact token count :)

Error Example:

ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: 
400 Bad Request: Too many input tokens. Max input tokens: 8192, request input token count: 11530
To handle long text, there are two approaches for truncation:

Static Approach
Use a smaller, safe character-to-token ratio. According to Amazon's documentation:
The characters-to-token ratio in English is 4.7 characters per token, on average.
A conservative ratio, like 45, can be used.

Dynamic Approach
Wrap the logic in a try-catch block, let the API throw an error for long text, and then adjust based on the exact token count (maybe I'm overthinking though)

try:
    all_embeddings.append(self._embed_amazon(sentence))
except ValidationError as e:
    pattern = "request input token count:" + r"\s*(\d+)"
    match = re.search(pattern, str(e))
    if match:
        num_tokens = int(match.group(1))
        ratio = 0.9 * (self._max_tokens / num_tokens)
        max_sequence_length = int(len(sentence) * ratio)
        all_embeddings.append(self._embed_amazon(sentence[:max_sequence_length]))
    else:
        raise e

I used the first method, but if you think the second one is better, let me know.

@HSILA
Copy link
Contributor Author

HSILA commented Jan 8, 2025

While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the text-embedding-ada-002 model actually has a context length of 8191. If you push it to 8192 tokens, the API doesn’t throw an error (probably to align with the text-embedding-3-* models), but the model sometimes returns null values instead. I think we should update openai_models.py to fix this. Here’s a script to reproduce the issue, change 8191 to 8192 and see the results.

from openai import OpenAI
import tiktoken
import numpy as np

client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")

sn = 'Hello World' * 5000

print(f'Num tokens: {len(encoding.encode(sn))}')

truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)

response = client.embeddings.create(
    input=truncated_sentence,
    model="text-embedding-ada-002",
    encoding_format="float",
)

em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}')

@Samoed
Copy link
Collaborator

Samoed commented Jan 9, 2025

Yes, we have some inconsistency with max_tokens in OpenAI models. In ModelMeta, it states max_tokens=8191, but in the class, it is set to 8192. I think you could create a separate PR to fix this.

For Bedrock token count, I suggest adding a static calculation for the number of tokens along with a dynamic approach to ensure the models work correctly.

Additionally, could you create a table here comparing the results from the paper with your results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate ChemTEB
2 participants