Model | Leaderboard | Methodology | Evaluation | Robustness Analysis | Limitation | Citation | Outlook |
Ethan Chern*, Haoyang Zou*, Xuefeng Li*, Jiewen Hu*, Kehua Feng, Junlong Li, Pengfei Liu+
- "*" Core contributors,
- "+" Corresponding Author, (GAIR) at Shanghai Jiao Tong University, Shanghai AI Lab
π₯[2023/12/12] We released Abel-7B-002, resulting in a stronger (35% improvement on GSM8K, 126% improvement on MATH) and more generalized model, achieving the best performance among all 7B models (80.44 on GSM8K, 29.46 on MATH)
- Please check the Model and Leaderboard for the latest results. We achieved an accuracy of over 80% on GSM8K for the first time with 7B model.
- Refer to the Generalization section for our evaluation results on the model's generalization capabilities.
Model Name | HF Checkpoints | GSM8k | MATH | License |
---|---|---|---|---|
Abel-7B-002 | π€ 7B | 80.44 | 29.46 | Apache License 2.0 |
Abel-7B-001 | π€ 7B | 59.74 | 13.00 | Llama 2 |
Abel-13B-001 | π€ 13B | 66.41 | 17.34 | Llama 2 |
Abel-70B-001 | π€ 70B | 83.62 | 28.26 | Llama 2 |
Model | GSM8k | MATH | MathQA | SVAMP | SCQ5K-EN | ARC-E | ARC-C | HellaSwag | MMLU |
---|---|---|---|---|---|---|---|---|---|
Abel-7B-002 | 80.44 | 29.46 | 69.78 | 77.67 | 55.95 | 77.67 | 55.05 | 77.72 | 61.19 |
Abel-7B-001 | 59.74 | 13 | 1.21 | 57.67 | 9.3 | 53.32 | 38.97 | 63.51 | 40.59 |
MetaMath-Mistral-7B | 77.7 | 28.2 | 33.94 | 79.33 | 37.6 | 78.48 | 51.93 | 76.44 | 61.93 |
Qwen-7b | 47.84 | 9.34 | 27.44 | 53 | 40.05 | 74.97 | 53.05 | 86.85 | 57.98 |
Mistral-7b | 37.83 | 9.06 | 25.73 | 63 | 39.6 | 76.83 | 53.22 | 76.31 | 64.05 |
Yi-6b | 32.6 | 5.78 | 26.98 | 55.67 | 35.5 | 73.66 | 49.53 | 68.97 | 64.02 |
LLaMA2-7b | 12.96 | 2.78 | 11.52 | 44 | 28.24 | 71.12 | 46.61 | 71.32 | 46.7 |
It can be found that:
- Abel-002 performs excellent on mathematical datasets (GSM8K, MATH, MathQA, SVAMP, SCQ5K-EN).
- It is also competitive on out-of-domain reasoning datasets (ARC-E, ARC-C, HellaSwag), surpassing the base model Mistral-7b.
- On the MMLU, Abel-7B-002 only shows a marginal decrease of 3 points compared to the mistral-7b, while Abel-7B-001 shows a decrease of 6 points compared to LLaMA2-7b.
Evaluation details:
- All evaluation results is the maximum values of few-shot and zero-shot results.
- The results of GSM8K, MATH, MathQA, SVAMP and SCQ5K-EN are evaluated by our scripts, while the results of MMLU, ARC-E, ARC-C, HellaSwag are evaluated by OpenCompass.
π Abel
is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though πββοΈπββοΈππββοΈπββοΈ.
We show that:
- without tools
- without continuing pretraining
- without reward model
- without RLHF
- ONLY using SFT
We have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically:
- the performance on GSM8K, at 83.62, surpasses top-tier models, such as PaLM-1, Minerva (Google), Claude-instant (Anthropic), ChatGPT (OpenAI), with only a 1-percentage-point lag behind Google's latest model, PaLM-2-Flan.
- achieving an accuracy rate of 28.26% on highly challenging mathematical competition problems (compared to GPT4's 42.5%), it maintains a significant lead over other open-source models, surpassing the previous best open-source model by 5.46%.
- the 7B and 13B models have achieved a historic milestone in open-source model performance in both GSM8K and MATH.
GAIRMath-Abel
secures 3 positions in the Top 10 rankings and stands as the only university-led project in the list (others are either star startups or big tech companies).- Using our approach, we not only achieved excellent results on GSM8K and MATH, but when given a new dataset (TALSCQ-EN), we quickly attained state-of-the-art (SOTA) performance without too much effort, surpassing the commercial multi-billion-dollar model MathGPT and GPT4.
We demonstrate that:
- the capabilities of SFT are significantly underestimated, and researchers should approach SFT with due reverence and caution
- exceptional mathematical problem-solving capability can be achieved solely through SFT, which elicits more imaginative possibilities in future exploration in this direction.
π
stands for the proprietary model whileπ
represents the open-source modelπ
suggests that model development is led by academic university (instead of companies)- We only consider models without using any tool (e.g., Python)
Ranking | Model | Param. | Leading Organization | GSM8K | MATH |
---|---|---|---|---|---|
π 1 | GPT-4 | unknown | OpenAI | 92.0 | 42.5 |
π 2 | Claude-2 | unknown | Anthropic | 88.0 | - |
π 3 | PaLM-2-Flan | unknown | 84.7 | 33.2 | |
π 4 | GAIRMath-Abel | 70B | π GAIR Lab at Shanghai Jiaotong University | 83.6 | 28.3 |
π 5 | WizardMath | 70B | Microsoft | 81.6 | 22.7 |
π 6 | Claude-Instant | unknown | Anthropic | 80.9 | - |
π 7 | ChatGPT | unknown | OpenAI | 80.8 | 34.1 |
π 4 | Abel-002 | 7B | π GAIR Lab at Shanghai Jiaotong University | 80.4 | 29.5 |
π 8 | ChatGPT-0301 | unknown | OpenAI | 74.9 | - |
π 9 | GAIRMath-Abel | 13B | π GAIR Lab at Shanghai Jiaotong University | 66.4 | 17.3 |
π 10 | GAIRMath-Abel | 7B | π GAIR Lab at Shanghai Jiaotong University | 59.7 | 13.0 |
π 11 | Minerva | 540B | 58.8 | 33.6 | |
π 12 | PaLM | 540B | 56.9 | 8.8 | |
π 13 | Llama-2 | 70B | Meta | 56.8 | 13.5 |
π 14 | RFT | 33B | OFA | 56.5 | 7.4 |
π 15 | Baichuan2-13B | 13B | Baichuan | 52.8 | 10.1 |
π 16 | Minerva | 62B | 52.4 | 27.6 | |
π 17 | PaLM | 64B | 52.4 | 4.4 | |
π 18 | RFT | 13B | OFA | 52.1 | 5.1 |
π 19 | LlaMA | 65B | Meta | 50.9 | 10.6 |
π 20 | QWen | 7B | Alibaba | 44.9 | 8.5 |
π 21 | Chinchilla | 70B | DeepMind | 43.7 | - |
π 22 | Llama-2 | 34B | Meta | 42.2 | 6.24 |
π 23 | Galactica | 30B | Meta | 41.7 | 12.7 |
π 24 | ChatGLM2 | 12B | Zhipu | 40.9 | - |
π 25 | Text-davinci-002 | 175B | OpenAI | 40.7 | 19.1 |
π 26 | Llama | 33B | Meta | 35.6 | 7.1 |
π 27 | GPT-3 | 175B | OpenAI | 34 | 5.2 |
π 28 | InternLM | 7B | Shanghai AI Lab | 31.2 | - |
π 29 | Llama-2 | 13B | Meta | 28.7 | 3.9 |
π 30 | Vicuna v1.3 | 13B | LMSys | 27.6 | - |
π 31 | Falcon | 40B | Technology Innovation Institute | 19.6 | 2.5 |
π 32 | Llama | 13B | Meta | 17.8 | 3.9 |
π 33 | MPT | 30B | MosaicML | 15.2 | 3.1 |
π 34 | Galactica | 6.7B | Meta | 10.2 | 2.2 |
We propose Parental Oversight, A Babysitting Strategy for Supervised Fine-tuning,
Parental Oversight
is not limited to any specific data processing method. Instead, it defines the data processing philosophy that should guide supervised fine-tuning in the era of Generative AI (GAI). We believe that in the era of GAI, data structure engineering has emerged as a new paradigm. Within this paradigm, the manner in which the fine-tuning data is processed significantly impacts the performance of the trained GAI. We expect a growing number of studies in the community to focus on this data processing philosophy.
The principle of Parental Oversight
emphasizes treating supervised fine-tuning with care and prudence. This is analogous to the way parents are encouraged to educate their children. Different types of data, along with their presentation formats (e.g., step-by-step reasoning, iterative refinement), can be likened to varied educational methods. Just as parents cautiously select the most effective approach to instruct their children, GAI practitioners should cautiously select the most effective data processing approaches to better instruct their LLMs.
Furthermore, the "the more data, the better" philosophy doesn't always hold true. The quality and relevance of annotated samples can often outweigh their quantity. Training samples used in SFT should not just present the right answer, but also instruct the model on how the correct answer was derived based on the knowledge of the LLM. Additionally, if the LLM's knowledge is not sufficient to answer a question, Parental Oversight
should step in to address the knowledge gaps promptly.
- Create a conda environment
conda create -n abel python=3.10
- Activate the environment
conda activate abel
- Run
pip install -r requirements.txt
. - Run
bash evaluation/eval.sh
. Part of the evaluation script is modified from Minerva. - Note: We did observe some non-fully deterministic nature when conducting evaluation, which might be related to this vllm issue. Thus, it's possible that the result you obtain may slightly differ from ours. You can also check our evaluation output in the
./outputs
directory.
Our robustness analysis consists of two parts: Adversarial Evaluation on the GSM8k_robust dataset and Supervised Transfer Learning on the TAL-SCQ5K-EN dataset. We perform a preliminary analysis to understand (1) whether Abel overfits the training dataset and is thus brittle to out-of-distribution testing samples and (2) whether our SFT approach can quickly transfer and generalize Abel to datasets from different distributions.
The GSM8k_robust dataset is a dataset we established based on the GSM8k dataset. We randomly modified the numbers within the questions of the GSM8k dataset, without altering any other information in the questions, using GPT-4. We also asked GPT-4 to generate the 'golden answers' for the modified questions. After manually reviewing a subset of these samples, we found that all the generated answers for the altered questions were accurate. We utilized the GSM8k_robust dataset to evaluate whether the models overfit the training data, making the models susceptible to out-of-distribution testing samples. Our analysis indicates that Abel is more robust to out-of-distribution testing samples compared to other models.
Model | GSM8k | GSM8k_robust | delta |
---|---|---|---|
Abel-7B | 59.74 | 58.23 | -1.51 |
Abel-13B | 66.41 | 66.57 | +0.16 |
Abel-70B | 83.62 | 81.80 | -1.82 |
WizardMath-70B | 81.60 | 74.91 | -6.70 |
WizardMath-13B | 63.90 | 59.51 | -4.39 |
RFT-7B | 41.7 | 37.98 | -3.72 |
We demonstrate that Abel-70B not only achieves SOTA on the GSM8k and MATH datasets but also generalizes well to TAL-SCQ5K-EN 2K, a newly released dataset by Math LLM provider TAL (ε₯½ζͺδΎ). Our analysis indicates that our SFT approach can successfully generalize Abel to datasets from different distributions. We will conduct further analyses and experiments to explore and improve Abel's generalization capabilities.
Model | TAL-SCQ5K-EN 2K Testing Benchmark |
---|---|
Abel-70B | 59.7 |
MathGPT | 59.0 |
GPT-4 | 51.0 |
Llama-70B | 43.8 |
- Overfitting: Despite conducting robustness analysis and considering that generative AI for mathematics inherently exhibits fragility (often necessitating advanced decoding strategies, such as majority voting), excessive reliance on constructing SFT samples to enhance performance can inevitably lead the model towards overfitting. (However, overfitting is not the primary concern of the current project because even with overfitting various augmented training data, it remains challenging to achieve favorable results on the test set, such as the MATH dataset, for complex mathematical reasoning tasks.) Nevertheless, we still need to perform more extensive robust analysis (#1) and actively explore training methods that can transform the model into a mathematical polymath and conduct a more comprehensive cross-domain generalization analysis.
- Generalization: A good mathematical model should not be limited to solving problems only on GSM8K and MATH datasets; it should be capable of handling various types of problems, including those that assess different knowledge domains and require different types of responses (e.g., multiple-choice, true/false, proofs, arithmetic, etc.). The current model's capabilities are insufficient to generalize to these diverse scenarios (#2).
- Universality: Ultimately, we anticipate that the mathematical reasoning abilities enabled by large models can be integrated into chatbots for various domains such as medicine, law, physics, chemistry, etc. The key to achieving AGI is incorporating the power of a strong mathematical model into other models, which is currently lacking in the current model (#3).
- Multilinguality: The current model's training data and base model constraints limit its ability to provide responses in languages other than English (#4).
- Advanced techniques: The current model primarily focuses on SFT, and advanced techniques such as reward models, RLHF (Reinforcement Learning from Human Feedback), and tools have not yet been explored. (#5, #6)
We have created a list of issues to maintain these limitations and potential solutions. Your opinions and comments are always welcome.
Please cite the repo if the model/code/conclusion in this repo are helpful to you.
@misc{abel,
author = {Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei},
title = {Generative AI for Math: Abel},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/GAIR-NLP/abel}},
}
- We thank the Shanghai AI Lab for supporting a portion of the computing resources.
- We thank Jiasheng Gu for the helpful discussions in the early stage of the project.
We are continuously refining our models and will be releasing updates. Stay tuned!