`lm-evaluation-harness` + `promptsource`

Overview

This project provides a unified framework to test causal (GPT-2, GPT-3, GPTNeo, etc) and seq2seq (T5, T0) language models via prompt evaluation.

As of now, all prompts are provided via the promptsource eval-hackathon branch; all datasets are from huggingface datasets.

This fork is not backwards compatible with the original evaluation harness.

Installation

git clone https://github.com/bigscience-workshop/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"

CLI Usage 🖥️

To evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE WiC, you can run the following command:

python main.py \
    --model_api_name 'hf-causal' \
    --model_args pretrained='gpt2' \
    --task_name 'wic' \
    --template_names 'same_sense','polysemous' \
    --device cpu

Additional arguments can be provided to the model constructor using the --model_args flag. For larger models supported by HuggingFace transformers, we provide parallelism and mixed-precision utilities through the accelerate package. It can be activated for hf-causal/hf-seq2seq by passing use_accelerate=True and dtype=half to the --model_args flag, respectively. For finer grained control over accelerate options, see the constructor docstrings for HuggingFaceAutoLM in huggingface.py.

python main.py \
    --model_api_name 'hf-causal' \
    --model_args use_accelerate=True,pretrained='facebook/opt-13b' \
    --task_name wnli

If you have access to the OpenAI API, you can also evaluate GPT-3 engines:

export OPENAI_API_SECRET_KEY={YOUR_KEY_HERE}
python main.py \
    --model_api_name 'openai' \
    --model_args engine='curie' \
    --task_name hans

When reporting results from eval harness, please include the task versions (shown in results["versions"]) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible.

Detailed Usage

usage: main.py [-h] --model_api_name MODEL_API_NAME [--model_args MODEL_ARGS] --task_name TASK_NAME
               [--template_names TEMPLATE_NAMES] [--num_fewshot NUM_FEWSHOT] [--batch_size BATCH_SIZE]
               [--device DEVICE] [--limit LIMIT] [--output_path OUTPUT_PATH] [--template_idx TEMPLATE_IDX]
               [--bootstrap_iters BOOTSTRAP_ITERS] [--no_tracking] [--use_cache]

optional arguments:
  -h, --help            show this help message and exit
  --model_api_name MODEL_API_NAME
                        Name of the model API to use. See `lm_eval.list_model_apis()` for available APIs
  --model_args MODEL_ARGS
                        Model constructor args that you'd pass into a model of type `--model_api_name`. These must
                        be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces
  --task_name TASK_NAME
                        Name of the task to use as found in the lm_eval registry. See: `lm_eval.list_tasks()`
  --task_args TASK_ARGS
                        Optional task constructor args that you'd pass into a task class of kind " `--task_name`.
                        These must be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces.
                        WARNING: To avoid parsing errors, ensure your strings are quoted. For example,
                            `example_separator='\n+++\n'`
                        WARNING: Values must NOT contain commas.
  --template_names TEMPLATE_NAMES
                        Comma-separated list of template names for the specified task. Example:
                        `> python main.py ... --task_name rte --template_names imply,mean`
                        - Default: `all_templates`
                        - General Selectors:
                            - `"all_templates"`: Selects all templates for the task
                            - `"original_templates"`: Selects only templates that are designed to match the original task
  --num_fewshot NUM_FEWSHOT
  --batch_size BATCH_SIZE
  --seed SEED
  --device DEVICE       The device to place your model onto, e.g. cuda:0. For large models available through the
                        HuggingFace Hub you should use `accelerate` by passing `use_accelerate=True` to
                        `--model_args`
  --limit LIMIT         Limit the number of examples to evaluate on; ONLY USE THIS FOR DEBUGGING PURPOSES
  --output_path OUTPUT_PATH
                        Use output_path as `output_filename`. For example:
                        `> python main.py ... --output_path blop`
                        # saves files into `outputs/blop.json` Warning: You currently cannot change/add folder
                        structure.
  --template_idx TEMPLATE_IDX
                        Choose template by index from available templates
  --bootstrap_iters BOOTSTRAP_ITERS
                        Iters for stderr computation
  --no_tracking         Skip carbon emission tracking
  --use_cache           Whether to cache your model's predictions or not

Library Usage 📖

You can also use lm_eval as a library:

import lm_eval

model = lm_eval.get_model("hf-causal", pretrained="gpt2", device="cpu")
tasks = lm_eval.get_task_list(
    "superglue_rte",
    template_names=["does this imply", "must be true"])
results = lm_eval.evaluate(model=model, tasks=tasks)

The main user-facing functions are:

lm_eval.get_model(model_api_name, **kwargs) creates a model from a model API
lm_eval.get_task(task_name, template_name, **kwargs) creates a task with the prompt template
lm_eval.get_task_list(task_name, template_names, **kwargs) creates multiple instances of a task with different prompt templates
lm_eval.evaluate(model, tasks, **kwargs) evaluates a model on a list of tasks

Some high-level convenience functions are also made available:

lm_eval.list_model_apis() lists all available model APIs you can evaluate from
lm_eval.list_tasks() lists all available tasks
lm_eval.list_templates(task_name) lists all available templates for a task
lm_eval.get_templates(task_name) returns promptsource dataset templates for a task

Gotchas 🩹

You must pass templates to PerplexityTasks even though they will be ignored, as models will be scored from the raw text found in the task's dataset.
Multi-lingual ROUGE is unsupported as general token splitting is absent from rouge-score. For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended.
Task versioning is not fully integrated! If you're reporting your model's results, please include the package versions or commit IDs for this lm-evaluation-harness branch as well as the HuggingFace datasets and promptsource packages.
promptsource installation issue: Some prompts may be excluded from the installed promptsource branch due to git-based pip installation issues. If the latest commit on the promptsource/eval-hackathon branch contains a prompt you're looking for but was not included in the installed version from our setup.py, you should run the following from within your environment:
```
pip uninstall promptsource
git clone --single-branch --branch eval-hackathon https://github.com/bigscience-workshop/promptsource
cd promptsource
pip install -e .
```

Features

Growing number of tasks integrated with promptsource (20+).
Support for HuggingFace Causal language models, HuggingFace Seq2Seq models, and the OpenAI Completions API (GPT-3), with flexible tokenization-agnostic interfaces.

Implementing new tasks

To implement a new task in eval harness, follow the PromptSourceTask template.

Using load_from_disk instead of load_dataset

You can use load_from_disk (convenient on Jean Zay supercomputer) by setting task_args download_mode='load_from_disk',data_dir=<data/path>

Name		Name	Last commit message	Last commit date
Latest commit History 1,502 Commits
.github/workflows		.github/workflows
docs		docs
lm_eval		lm_eval
scripts		scripts
templates		templates
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
CODEOWNERS		CODEOWNERS
LICENSE.md		LICENSE.md
README.md		README.md
ignore.txt		ignore.txt
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`lm-evaluation-harness` + `promptsource`

Overview

Installation

CLI Usage 🖥️

Detailed Usage

Library Usage 📖

Gotchas 🩹

Features

Implementing new tasks

Using load_from_disk instead of load_dataset

About

Releases

Packages

Languages

License

bigscience-workshop/lm-evaluation-harness

Folders and files

Latest commit

History

Repository files navigation

lm-evaluation-harness + promptsource

Overview

Installation

CLI Usage 🖥️

Detailed Usage

Library Usage 📖

Gotchas 🩹

Features

Implementing new tasks

Using load_from_disk instead of load_dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`lm-evaluation-harness` + `promptsource`

Packages