Skip to content

GAIR-NLP/abel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Generative AI for Math: Abel

Model | Leaderboard | Methodology | Evaluation | Robustness Analysis | Limitation | Citation | Outlook |

Ethan Chern*, Haoyang Zou*, Xuefeng Li*, Jiewen Hu*, Kehua Feng, Junlong Li, Pengfei Liu+

  • "*" Core contributors,
  • "+" Corresponding Author, (GAIR) at Shanghai Jiao Tong University, Shanghai AI Lab

News

πŸ”₯[2023/12/12] We released Abel-7B-002, resulting in a stronger (35% improvement on GSM8K, 126% improvement on MATH) and more generalized model, achieving the best performance among all 7B models (80.44 on GSM8K, 29.46 on MATH)

  • Please check the Model and Leaderboard for the latest results. We achieved an accuracy of over 80% on GSM8K for the first time with 7B model.
  • Refer to the Generalization section for our evaluation results on the model's generalization capabilities.

Models and Performance

Model Name HF Checkpoints GSM8k MATH License
Abel-7B-002 πŸ€— 7B 80.44 29.46 Apache License 2.0
Abel-7B-001 πŸ€— 7B 59.74 13.00 Llama 2
Abel-13B-001 πŸ€— 13B 66.41 17.34 Llama 2
Abel-70B-001 πŸ€— 70B 83.62 28.26 Llama 2

Generalization

Model GSM8k MATH MathQA SVAMP SCQ5K-EN ARC-E ARC-C HellaSwag MMLU
Abel-7B-002 80.44 29.46 69.78 77.67 55.95 77.67 55.05 77.72 61.19
Abel-7B-001 59.74 13 1.21 57.67 9.3 53.32 38.97 63.51 40.59
MetaMath-Mistral-7B 77.7 28.2 33.94 79.33 37.6 78.48 51.93 76.44 61.93
Qwen-7b 47.84 9.34 27.44 53 40.05 74.97 53.05 86.85 57.98
Mistral-7b 37.83 9.06 25.73 63 39.6 76.83 53.22 76.31 64.05
Yi-6b 32.6 5.78 26.98 55.67 35.5 73.66 49.53 68.97 64.02
LLaMA2-7b 12.96 2.78 11.52 44 28.24 71.12 46.61 71.32 46.7

It can be found that:

  • Abel-002 performs excellent on mathematical datasets (GSM8K, MATH, MathQA, SVAMP, SCQ5K-EN).
  • It is also competitive on out-of-domain reasoning datasets (ARC-E, ARC-C, HellaSwag), surpassing the base model Mistral-7b.
  • On the MMLU, Abel-7B-002 only shows a marginal decrease of 3 points compared to the mistral-7b, while Abel-7B-001 shows a decrease of 6 points compared to LLaMA2-7b.

Evaluation details:

  • All evaluation results is the maximum values of few-shot and zero-shot results.
  • The results of GSM8K, MATH, MathQA, SVAMP and SCQ5K-EN are evaluated by our scripts, while the results of MMLU, ARC-E, ARC-C, HellaSwag are evaluated by OpenCompass.

Introduction

πŸ“ Abel is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though πŸƒβ€β™‚οΈπŸƒβ€β™€οΈπŸπŸƒβ€β™‚οΈπŸƒβ€β™€οΈ.

We show that:

  • without tools
  • without continuing pretraining
  • without reward model
  • without RLHF
  • ONLY using SFT

We have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically:

  • the performance on GSM8K, at 83.62, surpasses top-tier models, such as PaLM-1, Minerva (Google), Claude-instant (Anthropic), ChatGPT (OpenAI), with only a 1-percentage-point lag behind Google's latest model, PaLM-2-Flan.
  • achieving an accuracy rate of 28.26% on highly challenging mathematical competition problems (compared to GPT4's 42.5%), it maintains a significant lead over other open-source models, surpassing the previous best open-source model by 5.46%.
  • the 7B and 13B models have achieved a historic milestone in open-source model performance in both GSM8K and MATH.
  • GAIRMath-Abel secures 3 positions in the Top 10 rankings and stands as the only university-led project in the list (others are either star startups or big tech companies).
  • Using our approach, we not only achieved excellent results on GSM8K and MATH, but when given a new dataset (TALSCQ-EN), we quickly attained state-of-the-art (SOTA) performance without too much effort, surpassing the commercial multi-billion-dollar model MathGPT and GPT4.

We demonstrate that:

  • the capabilities of SFT are significantly underestimated, and researchers should approach SFT with due reverence and caution
  • exceptional mathematical problem-solving capability can be achieved solely through SFT, which elicits more imaginative possibilities in future exploration in this direction.

Leaderboard for Mathematical Reasoning

  • πŸ”’ stands for the proprietary model while 🌍 represents the open-source model
  • πŸŽ“ suggests that model development is led by academic university (instead of companies)
  • We only consider models without using any tool (e.g., Python)
Ranking Model Param. Leading Organization GSM8K MATH
πŸ”’ 1 GPT-4 unknown OpenAI 92.0 42.5
πŸ”’ 2 Claude-2 unknown Anthropic 88.0 -
πŸ”’ 3 PaLM-2-Flan unknown Google 84.7 33.2
🌍 4 GAIRMath-Abel 70B πŸŽ“ GAIR Lab at Shanghai Jiaotong University 83.6 28.3
🌍 5 WizardMath 70B Microsoft 81.6 22.7
πŸ”’ 6 Claude-Instant unknown Anthropic 80.9 -
πŸ”’ 7 ChatGPT unknown OpenAI 80.8 34.1
🌍 4 Abel-002 7B πŸŽ“ GAIR Lab at Shanghai Jiaotong University 80.4 29.5
πŸ”’ 8 ChatGPT-0301 unknown OpenAI 74.9 -
🌍 9 GAIRMath-Abel 13B πŸŽ“ GAIR Lab at Shanghai Jiaotong University 66.4 17.3
🌍 10 GAIRMath-Abel 7B πŸŽ“ GAIR Lab at Shanghai Jiaotong University 59.7 13.0
πŸ”’ 11 Minerva 540B Google 58.8 33.6
πŸ”’ 12 PaLM 540B Google 56.9 8.8
🌍 13 Llama-2 70B Meta 56.8 13.5
🌍 14 RFT 33B OFA 56.5 7.4
🌍 15 Baichuan2-13B 13B Baichuan 52.8 10.1
πŸ”’ 16 Minerva 62B Google 52.4 27.6
πŸ”’ 17 PaLM 64B Google 52.4 4.4
🌍 18 RFT 13B OFA 52.1 5.1
🌍 19 LlaMA 65B Meta 50.9 10.6
🌍 20 QWen 7B Alibaba 44.9 8.5
πŸ”’ 21 Chinchilla 70B DeepMind 43.7 -
🌍 22 Llama-2 34B Meta 42.2 6.24
πŸ”’ 23 Galactica 30B Meta 41.7 12.7
🌍 24 ChatGLM2 12B Zhipu 40.9 -
πŸ”’ 25 Text-davinci-002 175B OpenAI 40.7 19.1
🌍 26 Llama 33B Meta 35.6 7.1
πŸ”’ 27 GPT-3 175B OpenAI 34 5.2
🌍 28 InternLM 7B Shanghai AI Lab 31.2 -
🌍 29 Llama-2 13B Meta 28.7 3.9
🌍 30 Vicuna v1.3 13B LMSys 27.6 -
🌍 31 Falcon 40B Technology Innovation Institute 19.6 2.5
🌍 32 Llama 13B Meta 17.8 3.9
🌍 33 MPT 30B MosaicML 15.2 3.1
πŸ”’ 34 Galactica 6.7B Meta 10.2 2.2

Methodology

We propose Parental Oversight, A Babysitting Strategy for Supervised Fine-tuning,

Parental Oversight is not limited to any specific data processing method. Instead, it defines the data processing philosophy that should guide supervised fine-tuning in the era of Generative AI (GAI). We believe that in the era of GAI, data structure engineering has emerged as a new paradigm. Within this paradigm, the manner in which the fine-tuning data is processed significantly impacts the performance of the trained GAI. We expect a growing number of studies in the community to focus on this data processing philosophy.

The principle of Parental Oversight emphasizes treating supervised fine-tuning with care and prudence. This is analogous to the way parents are encouraged to educate their children. Different types of data, along with their presentation formats (e.g., step-by-step reasoning, iterative refinement), can be likened to varied educational methods. Just as parents cautiously select the most effective approach to instruct their children, GAI practitioners should cautiously select the most effective data processing approaches to better instruct their LLMs.

Furthermore, the "the more data, the better" philosophy doesn't always hold true. The quality and relevance of annotated samples can often outweigh their quantity. Training samples used in SFT should not just present the right answer, but also instruct the model on how the correct answer was derived based on the knowledge of the LLM. Additionally, if the LLM's knowledge is not sufficient to answer a question, Parental Oversight should step in to address the knowledge gaps promptly.

Evaluation

  • Create a conda environment conda create -n abel python=3.10
  • Activate the environment conda activate abel
  • Run pip install -r requirements.txt.
  • Run bash evaluation/eval.sh. Part of the evaluation script is modified from Minerva.
  • Note: We did observe some non-fully deterministic nature when conducting evaluation, which might be related to this vllm issue. Thus, it's possible that the result you obtain may slightly differ from ours. You can also check our evaluation output in the ./outputs directory.

Robustness Analysis

Our robustness analysis consists of two parts: Adversarial Evaluation on the GSM8k_robust dataset and Supervised Transfer Learning on the TAL-SCQ5K-EN dataset. We perform a preliminary analysis to understand (1) whether Abel overfits the training dataset and is thus brittle to out-of-distribution testing samples and (2) whether our SFT approach can quickly transfer and generalize Abel to datasets from different distributions.

Adversarial Evaluation on the GSM8k_robust Dataset

The GSM8k_robust dataset is a dataset we established based on the GSM8k dataset. We randomly modified the numbers within the questions of the GSM8k dataset, without altering any other information in the questions, using GPT-4. We also asked GPT-4 to generate the 'golden answers' for the modified questions. After manually reviewing a subset of these samples, we found that all the generated answers for the altered questions were accurate. We utilized the GSM8k_robust dataset to evaluate whether the models overfit the training data, making the models susceptible to out-of-distribution testing samples. Our analysis indicates that Abel is more robust to out-of-distribution testing samples compared to other models.

Model GSM8k GSM8k_robust delta
Abel-7B 59.74 58.23 -1.51
Abel-13B 66.41 66.57 +0.16
Abel-70B 83.62 81.80 -1.82
WizardMath-70B 81.60 74.91 -6.70
WizardMath-13B 63.90 59.51 -4.39
RFT-7B 41.7 37.98 -3.72

Supervised Transfer Learning on the TAL-SCQ5K-EN Dataset

We demonstrate that Abel-70B not only achieves SOTA on the GSM8k and MATH datasets but also generalizes well to TAL-SCQ5K-EN 2K, a newly released dataset by Math LLM provider TAL (ε₯½ζœͺδΎ†). Our analysis indicates that our SFT approach can successfully generalize Abel to datasets from different distributions. We will conduct further analyses and experiments to explore and improve Abel's generalization capabilities.

Model TAL-SCQ5K-EN 2K Testing Benchmark
Abel-70B 59.7
MathGPT 59.0
GPT-4 51.0
Llama-70B 43.8

Demo

Limitation

  • Overfitting: Despite conducting robustness analysis and considering that generative AI for mathematics inherently exhibits fragility (often necessitating advanced decoding strategies, such as majority voting), excessive reliance on constructing SFT samples to enhance performance can inevitably lead the model towards overfitting. (However, overfitting is not the primary concern of the current project because even with overfitting various augmented training data, it remains challenging to achieve favorable results on the test set, such as the MATH dataset, for complex mathematical reasoning tasks.) Nevertheless, we still need to perform more extensive robust analysis (#1) and actively explore training methods that can transform the model into a mathematical polymath and conduct a more comprehensive cross-domain generalization analysis.
  • Generalization: A good mathematical model should not be limited to solving problems only on GSM8K and MATH datasets; it should be capable of handling various types of problems, including those that assess different knowledge domains and require different types of responses (e.g., multiple-choice, true/false, proofs, arithmetic, etc.). The current model's capabilities are insufficient to generalize to these diverse scenarios (#2).
  • Universality: Ultimately, we anticipate that the mathematical reasoning abilities enabled by large models can be integrated into chatbots for various domains such as medicine, law, physics, chemistry, etc. The key to achieving AGI is incorporating the power of a strong mathematical model into other models, which is currently lacking in the current model (#3).
  • Multilinguality: The current model's training data and base model constraints limit its ability to provide responses in languages other than English (#4).
  • Advanced techniques: The current model primarily focuses on SFT, and advanced techniques such as reward models, RLHF (Reinforcement Learning from Human Feedback), and tools have not yet been explored. (#5, #6)

We have created a list of issues to maintain these limitations and potential solutions. Your opinions and comments are always welcome.

Citation

Please cite the repo if the model/code/conclusion in this repo are helpful to you.

@misc{abel,
  author = {Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei},
  title = {Generative AI for Math: Abel},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/GAIR-NLP/abel}},
}

Acknowledgement

  • We thank the Shanghai AI Lab for supporting a portion of the computing resources.
  • We thank Jiasheng Gu for the helpful discussions in the early stage of the project.

Outlook

We are continuously refining our models and will be releasing updates. Stay tuned!