HREF: Human Response Guided Evaluation for Instruction Following

📑 Paper | 🤗 Leaderboard | 🤗 Development Set | 🤗 Human Agreement Set

Announcement

11/07/2024: 🌟We officially publish the paper HREF paper, along with this codebase, the HREF leaderboard, the validaiton set, and the human agreement set! 🌟

Citation

@article{lyu2024href,
      title={HREF: Human Response-Guided Evaluation of Instruction Following in Language Models}, 
      author={Xinxi Lyu and Yizhong Wang and Hannaneh Hajishirzi and Pradeep Dasigi},
      journal={arXiv preprint arXiv:2412.15524},
      year={2024} 
}

Install

Make a new Python 3.10 environment and install the href package.

# (optional)create the conda environment
conda create -n href python=3.10
conda activate href
# install the href package
pip install -e .

Quick Start 🏃

To evaluate a supported model on href development set (See a list under href/generation/configs), run:

href evaluate \
    --model_name Llama-3.1-8B-Instruct \
    --annotator href

Evaluation on HREF 🔗

Evaluate a supported model

To evaluate a supported model (See a list under href/generation/configs) beginning from generating its output using a supported annotator, run:

href evaluate \
    --model_name Llama-3.1-8B-Instruct \
    --annotator llama3.1-70b_basic_w_reference \
    --use_human_reference

General arguments

--model_name❗: the model name that corresponds to the name of the yaml configuration file under generation_config_dir (exclude .yaml).
--generation_config_dir: the directory that contains the model generation configuration files.
--dataset: the huggingface dataset name or the path to a local file to use for evaluation. Default to use the development set of HREF.
--split: the split to use in dataset. Default to be dev.
--nr_cateogry: categories in the HREF to include. Default to use all 8 categories.
--seed: random seed.
--save_dir: directory to save all results.
--cache_dir: the directory to store downloaded datasets, models, and intermmediate annotation files

Evaluation arguments

annotator❗: name of the evaluation methods. It has to be one the three following: 1. a basic annotator defined in evaluation/evaluators.DEFINED_ANNOTATORS. 2. a configuration name for llm_as_a_judge that corresponds to a directory in llm_as_a_judge. 3. a suite of the above two types of unit evaluators defined in evaluation/evaluators.DEFINED_ANNOTATOR_SUITE_DICT. Default to be suite href that we defined in our paper.
--config_dir: the directory to contain configures for llm_as_a_judge evaluators.
--use_human_reference❗: whether of not annotator needs to use the human responses. No need to specify if annotator specifies a evaluator suite.

Evaluate a custom model

There are several ways to evaluate your own model/tokenizer on the development set:

Option 1: if your model supports the huggingface transformers interface:

Create a generation configuration for your model (examples can be found under generation/configs), see the dropdown below to see a discription for each configurable argument in the yaml file:

Generation arguments

model_name_or_path❗: the huggingface model name or the path to a local directory that contains the model.
tokenizer_name_or_path❗: the huggingface tokenizer name or the path to a local directory that contains the tokenizer
use_vllm: if given, we will use vLLM to generate the responses.
use_slow_tokenizer: if given, we will use the slow tokenizer.
max_new_tokens: maximum number of new tokens to generate.
temperature❗: the temperature we use for model generation.
batch_size: batch size for generation.
load_in_8bit: load model in 8bit mode, which will reduce memory and speed up inference.
gptq: if given, we're evaluating a 4-bit quantized GPTQ model.
format❗: a string that must contain the placeholder {prompt} will be applied to every input for the model generation; designed for applying chat format; will call the tokenizer's apply_chat_template if set to the string default; remove this paratmer if no template needed to be applied.

Now run:

href evaluate \
    --model_name <your configuration file name> \
    --annotator href \ 
    --generation_config_dir < directory cotaining your file>

Option 2: if you only have generated responses

Generate model responses to the instruction field of the data in your own way.
Make sure to save responses in the following structure

📂 <your response directory>
 ┣ 📂 Brainstorm
 ┃ ┗ 📜 responses.jsonl
 ┣ 📂 Open QA
 ┃ ┗ 📜 responses.jsonl
 ┣ ...
 ┃ ...
 ┣ 📂 Multi-Document Sythesis
 ┃ ┗ 📜 responses.jsonl

where each data point in responses.jsonl contains the fields: instruction, output, generator.

Now run with --response_dir specified:

href evaluate \
    --model_name <any custom model name> \
    --response_dir <your response directory> \
    --annotator href

Option 3: add an API model other than OpenAI

Please follow the logic how we implement OpenAI API in href/generation/generate.py to add your own API model, which is relatively straightforward.

Build local leaderboard

To build the leaderboard using the evaluation results, run:

python scripts/build_leaderboard.py \
    --models <model_1> <model_2> ... <model_n> \
    --result_dir <path to directory that contains results>

Full arguments

--models: name of the models to build the leaderboard
--annotator: name of the evaluator that generated the annotation results.
--nr_category: categories in the HREF to include.
--result_dir: path to the dir that contains the annotation files.
--save_dir: directory to save all results.

Submit to HREF Leaderboard 🚀

To submit your custom model / change the configuration of your model to be evaluated on HREF's evaluation set and posted on the leaderboard, create a Github issue or directly email us at xxxATallenaiDOTorg with either the model generation configuration you have created in Option 1 in Evaluate a custom model.

Human Agreement Analysis

To calculate the human agreement rate of an evaluation method on HREF human agreement set (Section 3 and 4 in the paper), run:

href calculate_agreement \
    --annotator llama3.1-70b_basic_w_reference \
    --use_human_reference

General arguments

--dataset: the huggingface dataset name or the path to a local file to use for analysis. Default to use the human agreement set of HREF.
--split: the split to use in dataset.
--nr_cateogry: categories in the HREF to include. Default to use all 8 categories.
--seed: random seed.
--save_dir: directory to save all results.
--cache_dir: the directory to store downloaded datasets, models, and intermmediate annotation files

Evaluation arguments

annotator❗: name of the evaluation methods. It has to be one the three following: 1. a basic annotator defined in evaluation/evaluators.DEFINED_ANNOTATORS. 2. a configuration name for llm_as_a_judge that corresponds to a directory in llm_as_a_judge. 3. a suite of the above two types of unit evaluators defined in evaluation/evaluators.DEFINED_ANNOTATOR_SUITE_DICT. Default to be suite href that we defined in our paper.
--config_dir: the directory to contain configures for llm_as_a_judge evaluators.
--use_human_reference❗: whether of not annotator needs to use the human responses. No need to specify if annotator specifies a evaluator suite.

Add a New Evaluator

For this section, we give instructions on how to add a new evaluator <new_evaluator> that can be passed as the argument following --annotator for all commands.

Add a Non-LLM-based evaluator

Create a function for your evaluator in href/evaluation/evaluators.py.
Add the name <new_evaluator> to href.evaluation.evaluators.DEFINED_ANNOTATORS.
Run:

href calculate_agreement \
    --annotator <new_evaluator> \
    --use_human_reference

Add an LLM-based evaluator

To use llm_as_a_judge, we use a external package: a modified version of AlpacaEval. To create a new llm_as_a_judge evaluator, we modify the configuration with the following steps:

1. (Optional) Create a new prompt template

Create a new prompt template <new_template>.txt under href/llm_as_a_judge/prompt_templates.
Note that besides the text, there are many placeholders of models' special tokens for different models to fit in, do not change their names.
Refer to the existing template to write new templates.

2. Add a new evaluator configuration

This step is to specify the configuration of your evaluator both in generation and in modifying the base prompt template (due to different special tokens / system prompt for different models).
Add the configuration for <new_evaluator> under href/llm_as_a_judge/model_settings.json.
The dictionary template_kwargs contains keys that corresponds to the placeholders in the prompt templates, please fill the values with the corresponding special token of your model.
fn_completions and completions_kwargs are the configurations for the judge models. You can refer to the existing configurations for most of the desired setting. Please refer to AlpacaEval for more advanced settings.

3. Create the configuration file

To create the configuration file using the configurations from the previous two steps, run:

href create_config \
    --model_config_name <new_evaluator> \
    --template_name <new_template>

Required Arguments

--model_config_name❗: the name of the model configuration used as the judge defined in href/llm_as_a_judge/model_settings.json.
--template_name❗: the name of the template file in href/llm_as_a_judge/prompt_templates (without the suffix).

Optional Arguments

--config_dir: the directory to save the resulting configuration.
--no_exmple: if specified, remove the demonstration examples in the prompt.
--temperature: the temperature for the judge model.

This will create a configuration directory with the name <new_evaluator>_<new_template> under config_dir (default to be href/llm_as_a_judge/configs), which contains a configuration yaml file and a resulting prompt template. Now run:

href calculate_agreement \
    --annotator `<new_evaluator>_<new_template>` \
    --use_human_reference

Add an Evaluator suite

To create a evaluator suite where different unit evaluators are used for different categories, append to href/evaluation/evaluators.py/ANNOTATOR_SUITE_DICT where you specify the unit annotator with annotator and whether each annotator uses human responses with use_human_reference for each category. Then run:

href calculate_agreement --annotator `<new_evaluator_suite>`

Note that you should optionally pass in --use_human_reference according to whether your evaluator need to utilize the human responses unless your are specifying a evaluator suite.

Comparing Evaluators

To compare the human agreement rates among different annotators, run:

python scripts/compare_annotators.py \
    --annotators <evaluator_1> <evaluator_2> ... <evaluator_n>  \
    --result_dir <path to directory that contains results>

Full arguments

--annotors: names of the evaluation methods. Each has to be one the three following: 1. a basic annotator defined in evaluation/evaluators.DEFINED_ANNOTATORS. 2. a configuration name for llm_as_a_judge that corresponds to a directory in llm_as_a_judge. 3. a suite of the above two types of unit evaluators defined in evaluation/evaluators.DEFINED_ANNOTATOR_SUITE_DICT`.
--dataset: the huggingface dataset name or the path to a local file to use for analysis. Default to use the human agreement set of HREF.
--split: the split to use in dataset.
--nr_cateogry: categories in the HREF to include. Default to use all 8 categories.
--seed: random seed.
--save_dir: directory to save all results.
--cache_dir: the directory to store downloaded datasets, models, and intermmediate annotation files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
beaker_configs		beaker_configs
href		href
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HREF: Human Response Guided Evaluation for Instruction Following

Announcement

Citation

Install

Quick Start 🏃

Evaluation on HREF 🔗

Evaluate a supported model

Evaluate a custom model

Option 1: if your model supports the huggingface transformers interface:

Option 2: if you only have generated responses

Option 3: add an API model other than OpenAI

Build local leaderboard

Submit to HREF Leaderboard 🚀

Human Agreement Analysis

Add a New Evaluator

Add a Non-LLM-based evaluator

Add an LLM-based evaluator

1. (Optional) Create a new prompt template

2. Add a new evaluator configuration

3. Create the configuration file

Add an Evaluator suite

Comparing Evaluators

About

Releases

Packages

Languages

allenai/href

Folders and files

Latest commit

History

Repository files navigation

HREF: Human Response Guided Evaluation for Instruction Following

Announcement

Citation

Install

Quick Start 🏃

Evaluation on HREF 🔗

Evaluate a supported model

Evaluate a custom model

Option 1: if your model supports the huggingface transformers interface:

Option 2: if you only have generated responses

Option 3: add an API model other than OpenAI

Build local leaderboard

Submit to HREF Leaderboard 🚀

Human Agreement Analysis

Add a New Evaluator

Add a Non-LLM-based evaluator

Add an LLM-based evaluator

1. (Optional) Create a new prompt template

2. Add a new evaluator configuration

3. Create the configuration file

Add an Evaluator suite

Comparing Evaluators

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages