Generative AI for Math: Abel

Ethan Chern*, Haoyang Zou*, Xuefeng Li*, Jiewen Hu*, Kehua Feng, Junlong Li, Pengfei Liu+

"*" Core contributors,
"+" Corresponding Author, (GAIR) at Shanghai Jiao Tong University, Shanghai AI Lab

News

🔥[2023/12/12] We released Abel-7B-002, resulting in a stronger (35% improvement on GSM8K, 126% improvement on MATH) and more generalized model, achieving the best performance among all 7B models (80.44 on GSM8K, 29.46 on MATH)

Please check the Model and Leaderboard for the latest results. We achieved an accuracy of over 80% on GSM8K for the first time with 7B model.
Refer to the Generalization section for our evaluation results on the model's generalization capabilities.

Models and Performance

Model Name	HF Checkpoints	GSM8k	MATH	License
Abel-7B-002	🤗 7B	80.44	29.46	Apache License 2.0
Abel-7B-001	🤗 7B	59.74	13.00	Llama 2
Abel-13B-001	🤗 13B	66.41	17.34	Llama 2
Abel-70B-001	🤗 70B	83.62	28.26	Llama 2

Generalization

Model	GSM8k	MATH	MathQA	SVAMP	SCQ5K-EN	ARC-E	ARC-C	HellaSwag	MMLU
Abel-7B-002	80.44	29.46	69.78	77.67	55.95	77.67	55.05	77.72	61.19
Abel-7B-001	59.74	13	1.21	57.67	9.3	53.32	38.97	63.51	40.59
MetaMath-Mistral-7B	77.7	28.2	33.94	79.33	37.6	78.48	51.93	76.44	61.93
Qwen-7b	47.84	9.34	27.44	53	40.05	74.97	53.05	86.85	57.98
Mistral-7b	37.83	9.06	25.73	63	39.6	76.83	53.22	76.31	64.05
Yi-6b	32.6	5.78	26.98	55.67	35.5	73.66	49.53	68.97	64.02
LLaMA2-7b	12.96	2.78	11.52	44	28.24	71.12	46.61	71.32	46.7

It can be found that:

Abel-002 performs excellent on mathematical datasets (GSM8K, MATH, MathQA, SVAMP, SCQ5K-EN).
It is also competitive on out-of-domain reasoning datasets (ARC-E, ARC-C, HellaSwag), surpassing the base model Mistral-7b.
On the MMLU, Abel-7B-002 only shows a marginal decrease of 3 points compared to the mistral-7b, while Abel-7B-001 shows a decrease of 6 points compared to LLaMA2-7b.

Evaluation details:

All evaluation results is the maximum values of few-shot and zero-shot results.
The results of GSM8K, MATH, MathQA, SVAMP and SCQ5K-EN are evaluated by our scripts, while the results of MMLU, ARC-E, ARC-C, HellaSwag are evaluated by OpenCompass.

Introduction

📝 Abel is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though 🏃‍♂️🏃‍♀️🏁🏃‍♂️🏃‍♀️.

We show that:

without tools
without continuing pretraining
without reward model
without RLHF
ONLY using SFT

We have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically:

the performance on GSM8K, at 83.62, surpasses top-tier models, such as PaLM-1, Minerva (Google), Claude-instant (Anthropic), ChatGPT (OpenAI), with only a 1-percentage-point lag behind Google's latest model, PaLM-2-Flan.
achieving an accuracy rate of 28.26% on highly challenging mathematical competition problems (compared to GPT4's 42.5%), it maintains a significant lead over other open-source models, surpassing the previous best open-source model by 5.46%.
the 7B and 13B models have achieved a historic milestone in open-source model performance in both GSM8K and MATH.
GAIRMath-Abel secures 3 positions in the Top 10 rankings and stands as the only university-led project in the list (others are either star startups or big tech companies).
Using our approach, we not only achieved excellent results on GSM8K and MATH, but when given a new dataset (TALSCQ-EN), we quickly attained state-of-the-art (SOTA) performance without too much effort, surpassing the commercial multi-billion-dollar model MathGPT and GPT4.

We demonstrate that:

the capabilities of SFT are significantly underestimated, and researchers should approach SFT with due reverence and caution
exceptional mathematical problem-solving capability can be achieved solely through SFT, which elicits more imaginative possibilities in future exploration in this direction.

Leaderboard for Mathematical Reasoning

🔒 stands for the proprietary model while 🌍 represents the open-source model
🎓 suggests that model development is led by academic university (instead of companies)
We only consider models without using any tool (e.g., Python)

Ranking	Model	Param.	Leading Organization	GSM8K	MATH
🔒 1	GPT-4	unknown	OpenAI	92.0	42.5
🔒 2	Claude-2	unknown	Anthropic	88.0	-
🔒 3	PaLM-2-Flan	unknown	Google	84.7	33.2
🌍 4	GAIRMath-Abel	70B	🎓 GAIR Lab at Shanghai Jiaotong University	83.6	28.3
🌍 5	WizardMath	70B	Microsoft	81.6	22.7
🔒 6	Claude-Instant	unknown	Anthropic	80.9	-
🔒 7	ChatGPT	unknown	OpenAI	80.8	34.1
🌍 4	Abel-002	7B	🎓 GAIR Lab at Shanghai Jiaotong University	80.4	29.5
🔒 8	ChatGPT-0301	unknown	OpenAI	74.9	-
🌍 9	GAIRMath-Abel	13B	🎓 GAIR Lab at Shanghai Jiaotong University	66.4	17.3
🌍 10	GAIRMath-Abel	7B	🎓 GAIR Lab at Shanghai Jiaotong University	59.7	13.0
🔒 11	Minerva	540B	Google	58.8	33.6
🔒 12	PaLM	540B	Google	56.9	8.8
🌍 13	Llama-2	70B	Meta	56.8	13.5
🌍 14	RFT	33B	OFA	56.5	7.4
🌍 15	Baichuan2-13B	13B	Baichuan	52.8	10.1
🔒 16	Minerva	62B	Google	52.4	27.6
🔒 17	PaLM	64B	Google	52.4	4.4
🌍 18	RFT	13B	OFA	52.1	5.1
🌍 19	LlaMA	65B	Meta	50.9	10.6
🌍 20	QWen	7B	Alibaba	44.9	8.5
🔒 21	Chinchilla	70B	DeepMind	43.7	-
🌍 22	Llama-2	34B	Meta	42.2	6.24
🔒 23	Galactica	30B	Meta	41.7	12.7
🌍 24	ChatGLM2	12B	Zhipu	40.9	-
🔒 25	Text-davinci-002	175B	OpenAI	40.7	19.1
🌍 26	Llama	33B	Meta	35.6	7.1
🔒 27	GPT-3	175B	OpenAI	34	5.2
🌍 28	InternLM	7B	Shanghai AI Lab	31.2	-
🌍 29	Llama-2	13B	Meta	28.7	3.9
🌍 30	Vicuna v1.3	13B	LMSys	27.6	-
🌍 31	Falcon	40B	Technology Innovation Institute	19.6	2.5
🌍 32	Llama	13B	Meta	17.8	3.9
🌍 33	MPT	30B	MosaicML	15.2	3.1
🔒 34	Galactica	6.7B	Meta	10.2	2.2

Methodology

We propose Parental Oversight, A Babysitting Strategy for Supervised Fine-tuning,

Parental Oversight is not limited to any specific data processing method. Instead, it defines the data processing philosophy that should guide supervised fine-tuning in the era of Generative AI (GAI). We believe that in the era of GAI, data structure engineering has emerged as a new paradigm. Within this paradigm, the manner in which the fine-tuning data is processed significantly impacts the performance of the trained GAI. We expect a growing number of studies in the community to focus on this data processing philosophy.

The principle of Parental Oversight emphasizes treating supervised fine-tuning with care and prudence. This is analogous to the way parents are encouraged to educate their children. Different types of data, along with their presentation formats (e.g., step-by-step reasoning, iterative refinement), can be likened to varied educational methods. Just as parents cautiously select the most effective approach to instruct their children, GAI practitioners should cautiously select the most effective data processing approaches to better instruct their LLMs.

Furthermore, the "the more data, the better" philosophy doesn't always hold true. The quality and relevance of annotated samples can often outweigh their quantity. Training samples used in SFT should not just present the right answer, but also instruct the model on how the correct answer was derived based on the knowledge of the LLM. Additionally, if the LLM's knowledge is not sufficient to answer a question, Parental Oversight should step in to address the knowledge gaps promptly.

Evaluation

Create a conda environment conda create -n abel python=3.10
Activate the environment conda activate abel
Run pip install -r requirements.txt.
Run bash evaluation/eval.sh. Part of the evaluation script is modified from Minerva.
Note: We did observe some non-fully deterministic nature when conducting evaluation, which might be related to this vllm issue. Thus, it's possible that the result you obtain may slightly differ from ours. You can also check our evaluation output in the ./outputs directory.

Robustness Analysis

Our robustness analysis consists of two parts: Adversarial Evaluation on the GSM8k_robust dataset and Supervised Transfer Learning on the TAL-SCQ5K-EN dataset. We perform a preliminary analysis to understand (1) whether Abel overfits the training dataset and is thus brittle to out-of-distribution testing samples and (2) whether our SFT approach can quickly transfer and generalize Abel to datasets from different distributions.

Adversarial Evaluation on the GSM8k_robust Dataset

The GSM8k_robust dataset is a dataset we established based on the GSM8k dataset. We randomly modified the numbers within the questions of the GSM8k dataset, without altering any other information in the questions, using GPT-4. We also asked GPT-4 to generate the 'golden answers' for the modified questions. After manually reviewing a subset of these samples, we found that all the generated answers for the altered questions were accurate. We utilized the GSM8k_robust dataset to evaluate whether the models overfit the training data, making the models susceptible to out-of-distribution testing samples. Our analysis indicates that Abel is more robust to out-of-distribution testing samples compared to other models.

Model	GSM8k	GSM8k_robust	delta
Abel-7B	59.74	58.23	-1.51
Abel-13B	66.41	66.57	+0.16
Abel-70B	83.62	81.80	-1.82
WizardMath-70B	81.60	74.91	-6.70
WizardMath-13B	63.90	59.51	-4.39
RFT-7B	41.7	37.98	-3.72

Supervised Transfer Learning on the TAL-SCQ5K-EN Dataset

We demonstrate that Abel-70B not only achieves SOTA on the GSM8k and MATH datasets but also generalizes well to TAL-SCQ5K-EN 2K, a newly released dataset by Math LLM provider TAL (好未來). Our analysis indicates that our SFT approach can successfully generalize Abel to datasets from different distributions. We will conduct further analyses and experiments to explore and improve Abel's generalization capabilities.

Model	TAL-SCQ5K-EN 2K Testing Benchmark
Abel-70B	59.7
MathGPT	59.0
GPT-4	51.0
Llama-70B	43.8

Demo

Limitation

Overfitting: Despite conducting robustness analysis and considering that generative AI for mathematics inherently exhibits fragility (often necessitating advanced decoding strategies, such as majority voting), excessive reliance on constructing SFT samples to enhance performance can inevitably lead the model towards overfitting. (However, overfitting is not the primary concern of the current project because even with overfitting various augmented training data, it remains challenging to achieve favorable results on the test set, such as the MATH dataset, for complex mathematical reasoning tasks.) Nevertheless, we still need to perform more extensive robust analysis (#1) and actively explore training methods that can transform the model into a mathematical polymath and conduct a more comprehensive cross-domain generalization analysis.
Generalization: A good mathematical model should not be limited to solving problems only on GSM8K and MATH datasets; it should be capable of handling various types of problems, including those that assess different knowledge domains and require different types of responses (e.g., multiple-choice, true/false, proofs, arithmetic, etc.). The current model's capabilities are insufficient to generalize to these diverse scenarios (#2).
Universality: Ultimately, we anticipate that the mathematical reasoning abilities enabled by large models can be integrated into chatbots for various domains such as medicine, law, physics, chemistry, etc. The key to achieving AGI is incorporating the power of a strong mathematical model into other models, which is currently lacking in the current model (#3).
Multilinguality: The current model's training data and base model constraints limit its ability to provide responses in languages other than English (#4).
Advanced techniques: The current model primarily focuses on SFT, and advanced techniques such as reward models, RLHF (Reinforcement Learning from Human Feedback), and tools have not yet been explored. (#5, #6)

We have created a list of issues to maintain these limitations and potential solutions. Your opinions and comments are always welcome.

Citation

Please cite the repo if the model/code/conclusion in this repo are helpful to you.

@misc{abel,
  author = {Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei},
  title = {Generative AI for Math: Abel},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/GAIR-NLP/abel}},
}

Acknowledgement

We thank the Shanghai AI Lab for supporting a portion of the computing resources.
We thank Jiasheng Gu for the helpful discussions in the early stage of the project.

Outlook

We are continuously refining our models and will be releasing updates. Stay tuned!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data/test		data/test
evaluation		evaluation
fig		fig
outputs		outputs
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI for Math: Abel

News

Models and Performance

Generalization

Introduction

Leaderboard for Mathematical Reasoning

Methodology

Evaluation

Robustness Analysis

Adversarial Evaluation on the GSM8k_robust Dataset

Supervised Transfer Learning on the TAL-SCQ5K-EN Dataset

Demo

Limitation

Citation

Acknowledgement

Outlook

About

Releases

Packages

Contributors 3

Languages

GAIR-NLP/abel

Folders and files

Latest commit

History

Repository files navigation

Generative AI for Math: Abel

News

Models and Performance

Generalization

Introduction

Leaderboard for Mathematical Reasoning

Methodology

Evaluation

Robustness Analysis

Adversarial Evaluation on the GSM8k_robust Dataset

Supervised Transfer Learning on the TAL-SCQ5K-EN Dataset

Demo

Limitation

Citation

Acknowledgement

Outlook

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages