Skip to content

Commit

Permalink
Merge branch 'main' into xw/file-editor-linter
Browse files Browse the repository at this point in the history
  • Loading branch information
xingyaoww authored Oct 31, 2024
2 parents 03fd2d3 + db4e1db commit 80db8f8
Show file tree
Hide file tree
Showing 18 changed files with 593 additions and 259 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ghcr-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,7 @@ jobs:
exit 1
update_pr_description:
name: Update PR Description
if: github.event_name == 'pull_request'
if: github.event_name == 'pull_request' && !github.event.pull_request.head.repo.fork
needs: [ghcr_build_runtime]
runs-on: ubuntu-latest
steps:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ evaluation/bird/data
evaluation/gaia/data
evaluation/gorilla/data
evaluation/toolqa/data
evaluation/scienceagentbench/benchmark

# frontend

Expand Down
17 changes: 17 additions & 0 deletions evaluation/scienceagentbench/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM python:3.11-bookworm


# For OpenHands agents to explore the dataset directories, please download the full benchmark [here](https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/chen_8336_buckeyemail_osu_edu/EQuA6uJ3CtRHvRfZ2GiN1tYBRVJE4DSUD10MW61fr7HuSQ?e=sCBegG) and unzip it with password `scienceagentbench`.
# **Please DO NOT redistribute the unzipped data files online.**
# It will download a benchmark.zip file to the current directory.
# unzip it and put the benchmark folder under evaluation/scienceagentbench/

RUN mkdir -p /benchmark
COPY benchmark /benchmark

RUN mkdir -p /workspace
WORKDIR /workspace

# pushd evaluation/scienceagentbench
# docker build -t xingyaoww/openhands-eval-scienceagentbench .
# popd
25 changes: 25 additions & 0 deletions evaluation/scienceagentbench/Dockerfile.evaluator
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM mambaorg/micromamba:debian12

USER root
# For https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#code-generation-with-agents

RUN micromamba create -n sci-agent-eval python=3.10 pip setuptools wheel
RUN micromamba run -n sci-agent-eval pip install pip-tools

RUN mkdir -p /workspace
WORKDIR /workspace

RUN apt-get update && apt-get install -y git

RUN git clone https://github.com/OSU-NLP-Group/ScienceAgentBench.git /workspace/
RUN git checkout 4eddc7db6449a5ade3e37285747c8b208cd54ce7

RUN micromamba create -n sci-agent python=3.10 pip setuptools wheel
RUN micromamba run -n sci-agent pip install -r requirements.txt

# Replace all occurence of conda with micromamba under the /workspace
RUN find ./ -type f -exec sed -i 's/conda/micromamba/g' {} \;

# pushd evaluation/scienceagentbench
# docker build -t xingyaoww/openhands-eval-scienceagentbench-evaluator -f Dockerfile.evaluator .
# popd
54 changes: 54 additions & 0 deletions evaluation/scienceagentbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# ScienceAgentBench Evaluation with OpenHands

This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: https://arxiv.org/abs/2410.05080).

## Setup Environment and LLM Configuration

Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

## Setup ScienceAgentBench

To prevent benchmark data contamination, we only provide the annotation sheet on [Huggingface](https://huggingface.co/datasets/osunlp/ScienceAgentBench), which includes all necessary *inputs* to run an agent.

## Run Inference on ScienceAgentBench

```bash
./evaluation/scienceagentbench/scripts/run_infer.sh [model_config] [git-version] [use_knowledge] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

# Example
./evaluation/scienceagentbench/scripts/run_infer.sh llm.eval_gpt4o 0.9.3
```

where `model_config` is mandatory, and the rest are optional.

- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
LLM settings, as defined in your `config.toml`.
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would
like to evaluate. It could also be a release tag like `0.6.2`.
- `use_knowledge`, e.g. `true`, specifies whether allowing the agent to use expert-provided knowledge as additional input or not. By default, it is set to `false`.
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
in order to use `eval_limit`, you must also set `agent`.
- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
default, it is set to 30.
- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
default, it is set to 1.

## Evaluate Generated Programs

### Extract Necessary Information from OpenHands Log

After the inference is completed, you may use the following command to extract necessary information from the output log for evaluation:

```bash
python post_proc.py [log_fname]
```
- `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent.

Output will be write to e.g. `evaluation/.../output.converted.jsonl`

### Run evaluation

Please follow the steps [here](https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#evaluation-of-generated-code) to evaluate the generated programs.
30 changes: 30 additions & 0 deletions evaluation/scienceagentbench/post_proc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import json
from argparse import ArgumentParser

if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument(
'log_fname',
type=str,
)
args = parser.parse_args()

fname = args.log_fname
out_fname = args.log_fname.replace('.jsonl', '.converted.jsonl')

log = [json.loads(line) for line in open(fname)]

simple_log = [
json.dumps(
{
'instance_id': ex['instance_id'],
'instruction': ex['instruction'],
'test_result': ex['test_result'],
'cost': ex['metrics']['accumulated_cost'],
}
)
for ex in log
]

with open(out_fname, 'w+', encoding='utf-8') as f:
f.write('\n'.join(simple_log))
Loading

0 comments on commit 80db8f8

Please sign in to comment.