This project provides a unified framework to test causal (GPT-2, GPT-3, GPTNeo, etc) and seq2seq (T5, T0) language models via prompt evaluation.
As of now, all prompts are provided via the promptsource
eval-hackathon branch; all datasets are from huggingface datasets
.
This fork is not backwards compatible with the original evaluation harness.
git clone https://github.com/bigscience-workshop/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"
To evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE WiC, you can run the following command:
python main.py \
--model_api_name 'hf-causal' \
--model_args pretrained='gpt2' \
--task_name 'wic' \
--template_names 'same_sense','polysemous' \
--device cpu
Additional arguments can be provided to the model constructor using the --model_args
flag. For larger models supported by HuggingFace transformers
, we provide parallelism and mixed-precision utilities through the accelerate
package. It can be activated for hf-causal
/hf-seq2seq
by passing use_accelerate=True
and dtype=half
to the --model_args
flag, respectively. For finer grained control over accelerate
options, see the constructor docstrings for HuggingFaceAutoLM
in huggingface.py
.
python main.py \
--model_api_name 'hf-causal' \
--model_args use_accelerate=True,pretrained='facebook/opt-13b' \
--task_name wnli
If you have access to the OpenAI API, you can also evaluate GPT-3 engines:
export OPENAI_API_SECRET_KEY={YOUR_KEY_HERE}
python main.py \
--model_api_name 'openai' \
--model_args engine='curie' \
--task_name hans
When reporting results from eval harness, please include the task versions (shown in results["versions"]
) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible.
usage: main.py [-h] --model_api_name MODEL_API_NAME [--model_args MODEL_ARGS] --task_name TASK_NAME
[--template_names TEMPLATE_NAMES] [--num_fewshot NUM_FEWSHOT] [--batch_size BATCH_SIZE]
[--device DEVICE] [--limit LIMIT] [--output_path OUTPUT_PATH] [--template_idx TEMPLATE_IDX]
[--bootstrap_iters BOOTSTRAP_ITERS] [--no_tracking] [--use_cache]
optional arguments:
-h, --help show this help message and exit
--model_api_name MODEL_API_NAME
Name of the model API to use. See `lm_eval.list_model_apis()` for available APIs
--model_args MODEL_ARGS
Model constructor args that you'd pass into a model of type `--model_api_name`. These must
be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces
--task_name TASK_NAME
Name of the task to use as found in the lm_eval registry. See: `lm_eval.list_tasks()`
--task_args TASK_ARGS
Optional task constructor args that you'd pass into a task class of kind " `--task_name`.
These must be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces.
WARNING: To avoid parsing errors, ensure your strings are quoted. For example,
`example_separator='\n+++\n'`
WARNING: Values must NOT contain commas.
--template_names TEMPLATE_NAMES
Comma-separated list of template names for the specified task. Example:
`> python main.py ... --task_name rte --template_names imply,mean`
- Default: `all_templates`
- General Selectors:
- `"all_templates"`: Selects all templates for the task
- `"original_templates"`: Selects only templates that are designed to match the original task
--num_fewshot NUM_FEWSHOT
--batch_size BATCH_SIZE
--seed SEED
--device DEVICE The device to place your model onto, e.g. cuda:0. For large models available through the
HuggingFace Hub you should use `accelerate` by passing `use_accelerate=True` to
`--model_args`
--limit LIMIT Limit the number of examples to evaluate on; ONLY USE THIS FOR DEBUGGING PURPOSES
--output_path OUTPUT_PATH
Use output_path as `output_filename`. For example:
`> python main.py ... --output_path blop`
# saves files into `outputs/blop.json` Warning: You currently cannot change/add folder
structure.
--template_idx TEMPLATE_IDX
Choose template by index from available templates
--bootstrap_iters BOOTSTRAP_ITERS
Iters for stderr computation
--no_tracking Skip carbon emission tracking
--use_cache Whether to cache your model's predictions or not
You can also use lm_eval
as a library:
import lm_eval
model = lm_eval.get_model("hf-causal", pretrained="gpt2", device="cpu")
tasks = lm_eval.get_task_list(
"superglue_rte",
template_names=["does this imply", "must be true"])
results = lm_eval.evaluate(model=model, tasks=tasks)
The main user-facing functions are:
lm_eval.get_model(model_api_name, **kwargs)
creates a model from a model APIlm_eval.get_task(task_name, template_name, **kwargs)
creates a task with the prompt templatelm_eval.get_task_list(task_name, template_names, **kwargs)
creates multiple instances of a task with different prompt templateslm_eval.evaluate(model, tasks, **kwargs)
evaluates a model on a list of tasks
Some high-level convenience functions are also made available:
lm_eval.list_model_apis()
lists all available model APIs you can evaluate fromlm_eval.list_tasks()
lists all available taskslm_eval.list_templates(task_name)
lists all available templates for a tasklm_eval.get_templates(task_name)
returns promptsource dataset templates for a task
-
You must pass templates to
PerplexityTask
s even though they will be ignored, as models will be scored from the raw text found in the task's dataset. -
Multi-lingual ROUGE is unsupported as general token splitting is absent from rouge-score. For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE:
English
works as intended. -
Task versioning is not fully integrated! If you're reporting your model's results, please include the package versions or commit IDs for this
lm-evaluation-harness
branch as well as the HuggingFacedatasets
andpromptsource
packages. -
promptsource
installation issue: Some prompts may be excluded from the installedpromptsource
branch due to git-based pip installation issues. If the latest commit on thepromptsource/eval-hackathon
branch contains a prompt you're looking for but was not included in the installed version from oursetup.py
, you should run the following from within your environment:pip uninstall promptsource git clone --single-branch --branch eval-hackathon https://github.com/bigscience-workshop/promptsource cd promptsource pip install -e .
-
Growing number of tasks integrated with
promptsource
(20+). -
Support for HuggingFace Causal language models, HuggingFace Seq2Seq models, and the OpenAI Completions API (GPT-3), with flexible tokenization-agnostic interfaces.
To implement a new task in eval harness, follow the PromptSourceTask
template.
You can use load_from_disk (convenient on Jean Zay supercomputer) by setting task_args download_mode='load_from_disk',data_dir=<data/path>