ContextualBench is a powerful evaluation framework designed to assess the performance of Large Language Models (LLMs) on contextual datasets. It provides a flexible pipeline for evaluating various LLM families across different tasks, with a focus on handling large context inputs.
Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data.
Dynamic Retrieval Support: Efficiently handles large context inputs, allowing for comprehensive evaluation of LLMs' contextual understanding capabilities. Extensive Evaluation Dataset: Supports 7 contextual tasks, including: Question Answering (QA), Multi-Hop Question Answering, Classification tasks Multi-LLM Family Support: Compatible with a wide range of LLM families, including: Hugging Face models, Gemma, Mistral, OpenAI, Cohere.
Install the requirement.txt file using a conda environment having atleast python 3.9 and above using the command
conda create --name contextualbench python=3.9
conda activate contextualbench
pip install -r requirements.txt
Make sure to follow the steps in the vllm_flashinfer_steps.txt file to install the latest transformer, vllm and flashinfer (depending on your system) versions
pip install transformers --upgrade
pip install vllm==v0.5.3.post1
Use the config/config.yaml
file to tweak the hyperparameters like temperature, max_tokens, top_p etc. also initialise the different API keys that might be needed.
Existing retrieved context uploaded on ContextualBench can be used, or custom dataset can be generated using the retriever.py
files present in the respective folder.
Once the data is ready, make sure you have all the required libraries installed (and also environment variable set for OPENAI_API_KEY and COHERE_API_KEY)
Simply execute the command specified in the respective dataset folder to evaluate.
You can also use the run.py
file to run any dataset. Run the command
python run.py [dataset_name]
https://huggingface.co/spaces/Salesforce/ContextualBench-Leaderboard
We manage a leaderboard that ranks large language models (LLMs) based on their performance on ContextualBench Tasks, which can be found at the ContextualBench Leaderboard. To have your model evaluated and included on the leaderboard, please send your model's predictions (outputs) for all datasets to [email protected].
If you use this work, please cite the following -
@article{nguyen2024sfrrag,
title={SFR-RAG: Towards Contextually Faithful LLMs},
author={Nguyen, Xuan-Phi and Pandit, Shrey and Purushwalkam, Senthil and Xu, Austin and Chen, Hailin and Ming, Yifei and Ke, Zixuan and Savarese, Silvio and Xong, Caiming and Joty, Shafiq},
year={2024}
}