📣 FLAN-T5 is also useful in text-to-audio generation. Find our work at https://github.com/declare-lab/tango if you are interested.
This repository contains code to evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks. We aim to facilitate simple and convenient benchmarking across multiple tasks and models.
Instruction-tuned models such as Flan-T5 and Alpaca represent an exciting direction to approximate the performance of large language models (LLMs) like ChatGPT at lower cost. However, it is challenging to compare the performance of different models qualitatively. To evaluate how well the models generalize across a wide range of unseen and challenging tasks, we can use academic benchmarks such as MMLU and BBH. Compared to existing libraries such as evaluation-harness and HELM, this repo enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers 🤗 :
- AutoModelForCausalLM ( eg GPT-2, GPT-J , OPT-IML, BLOOMZ)
- AutoModelForSeq2SeqLM ( eg Flan-T5, Flan-UL2 , TK-Instruct)
- LlamaForCausalLM ( eg LLaMA , Alpaca, Vicuna)
- ChatGLM
Model Name | Model Path | Paper | Size | MMLU | BBH | DROP | HumanEval |
---|---|---|---|---|---|---|---|
GPT-4 | Link | ? | 86.4 | 80.9 | 67.0 | ||
ChatGPT | Link | ? | 70.0 | 64.1 | 48.1 | ||
seq_to_seq | google/flan-t5-xxl | Link | 11B | 54.5 | 43.9 | ||
seq_to_seq | google/flan-t5-xl | Link | 3B | 49.2 | 40.2 | 56.3 | |
llama | eachadea/vicuna-13b | Link | 13B | 49.7 | 37.1 | 32.9 | 15.2 |
llama | decapoda-research/llama-13b-hf | Link | 13B | 46.2 | 37.1 | 35.3 | 13.4 |
seq_to_seq | declare-lab/flan-alpaca-gpt4-xl | Link | 3B | 45.6 | 34.8 | ||
llama | TheBloke/koala-13B-HF | Link | 13B | 44.6 | 34.6 | 28.3 | 11.0 |
llama | chavinlo/alpaca-native | Link | 7B | 41.6 | 33.3 | 26.3 | 10.3 |
llama | TheBloke/wizardLM-7B-HF | Link | 7B | 36.4 | 32.9 | 15.2 | |
chatglm | THUDM/chatglm-6b | Link | 6B | 36.1 | 31.3 | 44.2 | 3.1 |
llama | decapoda-research/llama-7b-hf | Link | 7B | 35.2 | 30.9 | 27.6 | 10.3 |
llama | wombat-7b-gpt4-delta | Link | 7B | 33.0 | 32.4 | 7.9 | |
seq_to_seq | bigscience/mt0-xl | Link | 3B | 30.4 | |||
causal | facebook/opt-iml-max-1.3b | Link | 1B | 27.5 | 1.8 | ||
causal | OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 | Link | 12B | 27.0 | 30.0 | 9.1 | |
causal | stabilityai/stablelm-base-alpha-7b | Link | 7B | 26.2 | 1.8 | ||
causal | databricks/dolly-v2-12b | Link | 12B | 25.7 | 7.9 | ||
causal | Salesforce/codegen-6B-mono | Link | 6B | 27.4 | |||
causal | togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 | Link | 7B | 38.1 | 31.3 | 24.7 | 5.5 |
Evaluate on Massive Multitask Language Understanding (MMLU) which includes exam questions from 57 tasks such as mathematics, history, law, and medicine. We use 5-shot direct prompting and measure the exact-match score.
python main.py mmlu --model_name llama --model_path chavinlo/alpaca-native
# 0.4163936761145136
python main.py mmlu --model_name seq_to_seq --model_path google/flan-t5-xl
# 0.49252243270189433
Evaluate on Big Bench Hard (BBH) which includes 23 challenging tasks for which PaLM (540B) performs below an average human rater. We use 3-shot direct prompting and measure the exact-match score.
python main.py bbh --model_name llama --model_path TheBloke/koala-13B-HF --load_8bit
# 0.3468942926723247
Evaluate on DROP which is a math question answering benchmark. We use 3-shot direct prompting and measure the exact-match score.
python main.py drop --model_name seq_to_seq --model_path google/flan-t5-xl
# 0.5632458233890215
Evaluate on HumanEval which includes 164 coding questions in python. We use 0-shot direct prompting and measure the pass@1 score.
python main.py humaneval --model_name llama --model_path eachadea/vicuna-13b --n_sample 1 --load_8bit
# {'pass@1': 0.1524390243902439}
Install dependencies and download data.
conda create -n flan-eval python=3.8 -y
conda activate flan-eval
pip install -r requirements.txt
mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu