GitHub

ConsisEval: A Hard-to-Easy Consistency Evaluation
Benchmark for Large Language Models

This repo is for paper Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Overview

ConsisEval is developed to systematically evaluate the hard-to-easy consistency of LLMs. Here the hard-to-easy inconsistency refers to the counter-intuitive phenomenons where LLMs, while capable of solving hard problems, can paradoxically fail at easier ones.

ConsisEval includes 732 pair of questions from code (164), mathematics (298), and instruction-following (270) domains. It is noteworthy that there are only pairwise data in ConsisEval: one datum is comprised of two questions (an easy question and a harder one), and there is a strict order of difficulty between these two questions.

Data

Easy data is collected from gsm8k, IFEval and HumanEval.
hard data derived from easy data by automatic generation and human annotation.
ConsisEval (the combination of easy and hard data) is in directory data.

Evaluation Metric

Consistency Score (CS): conditional probability of a model correctly answering easy questions provided that it has correctly answered harder ones.

For more details about metrics, please refer to our paper.

Environments

All Python packages required to run are listed in requirements.txt, and can be installed by:

pip install -r requirements.txt

For evaluation on instruction-following domain, please run the following python code to download punkt:

>>> import nltk
>>> nltk.download('punkt')

Evalaution

Answer Generation

The code and script to generate answers is in eval.py and eval.sh, and proper arguments should be set before launch:

--model_name the name of evaluated model
--model_path the path to evaluated model (if not specified, model will be downloaded from HuggingFace)
--task the evaluated domain (only code, math and instruction_following are supported)
--sampling_times repeated sampling times for one question

After bash eval.sh, answers generated by evaluated model will be stored in ./log.

Metric Computation

The code and script to calculate metrics is in analysis.py and analysis.sh.

--model_name and --task should be specified for metric computation.

Citation

@misc{yang2024large,
      title={Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?}, 
      author={Zhe Yang and Yichang Zhang and Tianyu Liu and Jian Yang and Junyang Lin and Chang Zhou and Zhifang Sui},
      year={2024},
      eprint={2406.12809},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code_check		code_check
data		data
instruction_following_check		instruction_following_check
math_check		math_check
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
analysis.sh		analysis.sh
eval.py		eval.py
eval.sh		eval.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConsisEval: A Hard-to-Easy Consistency Evaluation
Benchmark for Large Language Models

Overview

Data

Evaluation Metric

Environments

Evalaution

Answer Generation

Metric Computation

Citation

About

Releases

Packages

Languages

License

QwenLM/ConsisEval

Folders and files

Latest commit

History

Repository files navigation

ConsisEval: A Hard-to-Easy Consistency EvaluationBenchmark for Large Language Models

Overview

Data

Evaluation Metric

Environments

Evalaution

Answer Generation

Metric Computation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

ConsisEval: A Hard-to-Easy Consistency Evaluation
Benchmark for Large Language Models

Packages