Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eval] Add ScienceAgentBench. #4645

Merged
merged 20 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ evaluation/bird/data
evaluation/gaia/data
evaluation/gorilla/data
evaluation/toolqa/data
evaluation/scienceagentbench/benchmark

# frontend

Expand Down
17 changes: 17 additions & 0 deletions evaluation/scienceagentbench/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM python:3.11-bookworm


# For OpenHands agents to explore the dataset directories, please download the full benchmark [here](https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/chen_8336_buckeyemail_osu_edu/EQuA6uJ3CtRHvRfZ2GiN1tYBRVJE4DSUD10MW61fr7HuSQ?e=sCBegG) and unzip it with password `scienceagentbench`.
# **Please DO NOT redistribute the unzipped data files online.**
# It will download a benchmark.zip file to the current directory.
# unzip it and put the benchmark folder under evaluation/scienceagentbench/

RUN mkdir -p /benchmark
COPY benchmark /benchmark

RUN mkdir -p /workspace
WORKDIR /workspace

# pushd evaluation/scienceagentbench
# docker build -t xingyaoww/openhands-eval-scienceagentbench .
# popd
53 changes: 53 additions & 0 deletions evaluation/scienceagentbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# ScienceAgentBench Evaluation with OpenHands

This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: https://arxiv.org/abs/2410.05080).

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

## Setup ScienceAgentBench

To prevent benchmark data contamination, we only provide the annotation sheet on [Huggingface](https://huggingface.co/datasets/osunlp/ScienceAgentBench), which includes all necessary *inputs* to run an agent.

## Run Inference on ScienceAgentBench

```bash
./evaluation/scienceagentbench/scripts/run_infer.sh [model_config] [git-version] [use_knowledge] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

# Example
./evaluation/scienceagentbench/scripts/run_infer.sh llm.eval_gpt4o 0.9.3
```

where `model_config` is mandatory, and the rest are optional.

- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
LLM settings, as defined in your `config.toml`.
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would
like to evaluate. It could also be a release tag like `0.6.2`.
- `use_knowledge`, e.g. `true`, specifies whether allowing the agent to use expert-provided knowledge as additional input or not. By default, it is set to `false`.
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
in order to use `eval_limit`, you must also set `agent`.
- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
default, it is set to 30.
- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
default, it is set to 1.

## Evaluate Generated Programs

### Extract Necessary Information from OpenHands Log

After the inference is completed, you may use the following command to extract necessary information from the output log for evaluation:

```bash
python post_proc.py --log_fname [log_fname] --out_fname [out_fname]
```
- `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent.
- `out_fname`, e.g. `gpt_4o_openhands.jsonl`, is the custom name of the simplified log with final versions of programs and agent costs.

### Run evaluation

Please follow the steps [here](https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#evaluation-of-generated-code) to evaluate the generated programs.
34 changes: 34 additions & 0 deletions evaluation/scienceagentbench/post_proc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import json
from argparse import ArgumentParser

if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument(
'--log_fname',
type=str,
)
parser.add_argument(
'--out_fname',
type=str,
)
args = parser.parse_args()

fname = args.log_fname
out_fname = args.out_fname

log = [json.loads(line) for line in open(fname)]

simple_log = [
json.dumps(
{
'instance_id': ex['instance_id'],
'instruction': ex['instruction'],
'test_result': ex['test_result'],
'cost': ex['metrics']['accumulated_cost'],
}
)
for ex in log
]

with open(out_fname, 'w+', encoding='utf-8') as f:
f.write('\n'.join(simple_log))
Loading
Loading