Skip to content

Latest commit

 

History

History
619 lines (590 loc) · 14.1 KB

README.md

File metadata and controls

619 lines (590 loc) · 14.1 KB

MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

[📜 Paper][🤗 HF Dataset][🐱 GitHub]

📢 News and Updates

[2024.12.01] 🔥We have initialized the repository.
[2024.12.16] 🔥We have added the evaluation results of Phi-4-14B and Llama-3.3-70B-Instruct.
[2024.12.20] 🔥We have released the validation dataset of MMLU-CF.
[2025.1.9] 🔥OpenCompass now support the MMLU-CF. Feel free to give it a try!

1. The Motivation of MMLU-CF

  • The open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose MMLU-CF.
  • (a) An instance of leakage in MMLU. When questions are used as prompt from the MMLU, certain LLMs, due to their memorization capabilities, directly provide choices identical to the original ones. (b) When questions are used as prompt from the MMLU-CF, LLMs only provide guessed choices. This indicates that the MMLU test set suffers from data contamination and memorization by some LLMs, while the proposed MMLU-CF avoids such leakage.

Fig1_a Fig1_b

2. How to Evaluate Your Models on the MMLU-CF Validation/Test Set

(1) We perform automated testing only on Huggingface models. After following the steps outlined below and obtaining the validation set results from OpenCompass, the test set results can then be accessed via GitHub Issues.

Step 1. Validation set evaluation: Obtaining the validation results for your model using LLM evaluation tools, OpenCompass. The validation data will be automatically loaded from Hugging Face.

  • For a 5-shot evaluation with Internlm 2.5:
opencompass --models hf_internlm2_5_1_8b_chat --datasets mmlu_cf_few_shot --summarizer mmlu_cf
  • For a 0-shot evaluation with Internlm 2.5:
opencompass --models hf_internlm2_5_1_8b_chat --datasets mmlu_cf_zero_shot --summarizer mmlu_cf

Step 2. Test set evaluation: With the validation results, submit a GitHub issue on the MMLU-CF GitHub homepage to request the test set results. Please follow the format below:

Example 1,

Title: 
Test set evaluation Request - add HF model [microsoft/phi-4]  
Content: 
The result on validation set: 68.5%

Example 2,

Fig6

Notably:

  • Ensure you use the format with square brackets [ ] as shown. The model name microsoft/phi-4 corresponds to the name on HuggingFace.
  • We will automatically submit your model. The time to receive the results depends on the number of models being evaluated, but it typically takes 1-2 weeks.

(2) For API models, if OpenCompass updates the model interface, you can obtain the test set results by sending a temporary key to Email after receiving the validation set results.

3. What is the Difference between MMLU-CF and MMLU

MMLU focuses on the breadth and reasoning without considering contamination prevention. We apply three decontamination rules to mitigate unintentional data leakage while collecting data from a broader domain. Meanwhile, our MMLU-CF benchmark maintains the test set closed-source to prevent malicious data leakage.

Fig4 Fig5

4. Leaderboard

Model MMLU MMLU-CF
5-shot 5-shot Test 5-shot Validation 5-shot Δ 0-shot Test 0-shot Validation 0-shot Δ
API
GPT-4o 88.0 73.4 73.4 +0.0 71.9 72.4 -0.5
GPT-4-Turbo 86.5 70.4 70.1 +0.3 68.9 68.7 +0.1
GPT-4o-mini 81.8 65.5 65.1 +0.4 66.0 65.3 +0.7
Gemini-1.5-Flash 78.7 64.8 64.9 -0.1 56.7 56.9 -0.2
GPT-3.5-Turbo 71.4 58.2 59.0 -0.8 57.2 58.1 -0.9
Large
Qwen2.5-72B-instruct 85.3 71.6 71.3 +0.3 70.6 70.4 +0.2
Llama-3-70B-instruct 82.0 68.9 68.8 +0.1 68.1 67.4 +0.7
Llama-3.3-70B-instruct 86.3 68.8 67.8 +1.0 67.6 67.5 +0.1
Llama-3.1-70B-instruct 86.0 68.7 68.1 +0.6 70.4 69.7 +0.7
Phi-3.5-MoE-instruct 78.9 64.6 64.5 +0.1 63.1 62.1 +1.0
Qwen2-72B-instruct 82.3 63.7 64.3 -0.6 62.4 62.5 -0.1
Mixtral-8x22B-instruct 76.2 62.8 62.5 +0.3 65.3 64.8 +0.5
Qwen1.5-72B-chat 75.6 59.8 60.2 -0.4 59.1 59.6 -0.5
Llama-2-70B-chat 68.9 52.2 51.8 +0.4 51.2 50.9 +0.3
Medium
Qwen2.5-32B-instruct 83.9 69.7 68.8 +0.9 68.9 68.8 +0.1
Phi-4-14B 84.8 67.8 68.5 -0.7 68.5 69.4 -0.9
Qwen2.5-14B-instruct 79.9 66.4 66.1 +0.3 67.0 66.0 +1.0
Phi-3-medium-instruct 77.9 64.2 64.2 +0.0 62.5 62.7 -0.2
Gemma2-27B 75.2 63.9 63.5 +0.4 64.2 64.0 +0.2
Yi-1.5-34B-chat 76.8 61.3 60.5 +0.8 60.6 59.5 +1.1
Mixtral-8x7B-instruct-v0.1 70.5 58.3 57.1 -1.2 58.9 58.5 +0.4
Deepseek-v2-lite-chat 55.7 49.3 48.7 +0.6 48.2 47.7 +0.5
Baichuan-2-13B-chat 57.3 48.3 48.6 -0.3 47.1 48.1 -1.0
Llama-2-13B-chat 54.8 42.8 42.1 +0.7 44.8 44.6 +0.2
Small
Qwen2.5-7B-instruct 75.4 61.3 60.4 +0.9 59.3 58.6 +0.7
Qwen2-7B-instruct 70.5 58.1 57.9 +0.2 58.3 57.4 +0.9
Glm-4-9B-chat 72.4 57.8 57.9 -0.1 58.6 58.7 -0.1
Internlm-2.5-7B-chat 72.8 57.3 56.8 +0.5 57.9 56.9 +1.0
Llama-3-8B-instruct 68.4 57.3 56.5 +0.8 56.4 55.4 +1.0
Llama-3.1-8B-instruct 68.1 57.1 57.9 -0.8 56.1 56.1 +0.0
Gemma-2-9B 71.3 53.7 53.3 +0.4 32.1 31.2 +0.9
Yi-1.5-6B-chat 62.8 52.8 51.4 +1.4 52.2 51.9 +0.3
Mistral-7B-instruct-v0.3 60.3 50.7 50.9 -0.2 51.1 50.9 +0.2
Baichuan-2-7B-chat 52.9 44.5 43.9 +0.6 43.9 44.0 -0.1
Llama-2-7B-chat 45.3 39.4 38.5 +0.9 41.9 40.9 +1.0
Mini
Phi-3-mini-instruct (3.8B) 70.9 57.9 58.1 -0.2 58.2 57.5 +0.7
Phi-3.5-mini-instruct (3.8B) 69.1 57.9 57.4 +0.5 58.3 57.7 +0.6
Qwen2.5-3B-instruct 64.4 55.9 56.4 -0.5 54.3 53.9 +0.4
Qwen2.5-1.5B-instruct 50.7 51.2 51.0 +0.2 50.7 50.4 +0.3
Qwen2-1.5B-instruct 52.4 47.1 47.5 -0.4 45.2 44.5 +0.7
Gemma-2-2B 51.3 43.9 42.4 +1.5 30.5 29.4 +0.9
Qwen2.5-0.5B-instruct 24.1 41.9 41.1 +0.8 36.0 34.9 +1.1
Internlm-2-chat-1.8b 47.1 40.5 39.4 +1.1 41.2 39.8 +1.4
Qwen2-0.5B-instruct 37.9 38.3 38.3 +0.0 33.5 33.5 +0.0

5. Data Construction Pipeline

Fig3 The pipeline involves (1) MCQ Collection to gather a diverse set of questions; (2) MCQ Cleaning to ensure quality; (3) Difficulty Sampling to ensure an appropriate difficulty distribution for questions; (4) LLMs checking: The LLMs, including GPT-4o, Gemini, and Claude, are reviewing the accuracy and safety of the data; and (5) Contamination-Free Processing to prevent data leakage and maintain dataset purity. Ultimately, this process results in the MMLU-CF, consisting of 10,000 questions for the closed-source test set and 10,000 for the open-source validation set.

6. Contact

For any inquiries or concerns, feel free to reach out to us via Email: Qihao Zhao and Yangyu Huang.

7. Citation

@misc{zhao2024mmlucfcontaminationfreemultitasklanguage,
      title={MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark}, 
      author={Qihao Zhao and Yangyu Huang and Tengchao Lv and Lei Cui and Qinzheng Sun and Shaoguang Mao and Xin Zhang and Ying Xin and Qiufeng Yin and Scarlett Li and Furu Wei},
      year={2024},
      eprint={2412.15194},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.15194}, 
}

8. License

This repository is licensed under the MIT License. The validation dataset of MMLU-CF is subject to the CDLA-2.0 License.