-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into xw/file-editor-linter
- Loading branch information
Showing
18 changed files
with
593 additions
and
259 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
FROM python:3.11-bookworm | ||
|
||
|
||
# For OpenHands agents to explore the dataset directories, please download the full benchmark [here](https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/chen_8336_buckeyemail_osu_edu/EQuA6uJ3CtRHvRfZ2GiN1tYBRVJE4DSUD10MW61fr7HuSQ?e=sCBegG) and unzip it with password `scienceagentbench`. | ||
# **Please DO NOT redistribute the unzipped data files online.** | ||
# It will download a benchmark.zip file to the current directory. | ||
# unzip it and put the benchmark folder under evaluation/scienceagentbench/ | ||
|
||
RUN mkdir -p /benchmark | ||
COPY benchmark /benchmark | ||
|
||
RUN mkdir -p /workspace | ||
WORKDIR /workspace | ||
|
||
# pushd evaluation/scienceagentbench | ||
# docker build -t xingyaoww/openhands-eval-scienceagentbench . | ||
# popd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
FROM mambaorg/micromamba:debian12 | ||
|
||
USER root | ||
# For https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#code-generation-with-agents | ||
|
||
RUN micromamba create -n sci-agent-eval python=3.10 pip setuptools wheel | ||
RUN micromamba run -n sci-agent-eval pip install pip-tools | ||
|
||
RUN mkdir -p /workspace | ||
WORKDIR /workspace | ||
|
||
RUN apt-get update && apt-get install -y git | ||
|
||
RUN git clone https://github.com/OSU-NLP-Group/ScienceAgentBench.git /workspace/ | ||
RUN git checkout 4eddc7db6449a5ade3e37285747c8b208cd54ce7 | ||
|
||
RUN micromamba create -n sci-agent python=3.10 pip setuptools wheel | ||
RUN micromamba run -n sci-agent pip install -r requirements.txt | ||
|
||
# Replace all occurence of conda with micromamba under the /workspace | ||
RUN find ./ -type f -exec sed -i 's/conda/micromamba/g' {} \; | ||
|
||
# pushd evaluation/scienceagentbench | ||
# docker build -t xingyaoww/openhands-eval-scienceagentbench-evaluator -f Dockerfile.evaluator . | ||
# popd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# ScienceAgentBench Evaluation with OpenHands | ||
|
||
This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: https://arxiv.org/abs/2410.05080). | ||
|
||
## Setup Environment and LLM Configuration | ||
|
||
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM. | ||
|
||
## Setup ScienceAgentBench | ||
|
||
To prevent benchmark data contamination, we only provide the annotation sheet on [Huggingface](https://huggingface.co/datasets/osunlp/ScienceAgentBench), which includes all necessary *inputs* to run an agent. | ||
|
||
## Run Inference on ScienceAgentBench | ||
|
||
```bash | ||
./evaluation/scienceagentbench/scripts/run_infer.sh [model_config] [git-version] [use_knowledge] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] | ||
|
||
# Example | ||
./evaluation/scienceagentbench/scripts/run_infer.sh llm.eval_gpt4o 0.9.3 | ||
``` | ||
|
||
where `model_config` is mandatory, and the rest are optional. | ||
|
||
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your | ||
LLM settings, as defined in your `config.toml`. | ||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would | ||
like to evaluate. It could also be a release tag like `0.6.2`. | ||
- `use_knowledge`, e.g. `true`, specifies whether allowing the agent to use expert-provided knowledge as additional input or not. By default, it is set to `false`. | ||
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting | ||
to `CodeActAgent`. | ||
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By | ||
default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note: | ||
in order to use `eval_limit`, you must also set `agent`. | ||
- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By | ||
default, it is set to 30. | ||
- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By | ||
default, it is set to 1. | ||
|
||
## Evaluate Generated Programs | ||
|
||
### Extract Necessary Information from OpenHands Log | ||
|
||
After the inference is completed, you may use the following command to extract necessary information from the output log for evaluation: | ||
|
||
```bash | ||
python post_proc.py [log_fname] | ||
``` | ||
- `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent. | ||
|
||
Output will be write to e.g. `evaluation/.../output.converted.jsonl` | ||
|
||
### Run evaluation | ||
|
||
Please follow the steps [here](https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#evaluation-of-generated-code) to evaluate the generated programs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
import json | ||
from argparse import ArgumentParser | ||
|
||
if __name__ == '__main__': | ||
parser = ArgumentParser() | ||
parser.add_argument( | ||
'log_fname', | ||
type=str, | ||
) | ||
args = parser.parse_args() | ||
|
||
fname = args.log_fname | ||
out_fname = args.log_fname.replace('.jsonl', '.converted.jsonl') | ||
|
||
log = [json.loads(line) for line in open(fname)] | ||
|
||
simple_log = [ | ||
json.dumps( | ||
{ | ||
'instance_id': ex['instance_id'], | ||
'instruction': ex['instruction'], | ||
'test_result': ex['test_result'], | ||
'cost': ex['metrics']['accumulated_cost'], | ||
} | ||
) | ||
for ex in log | ||
] | ||
|
||
with open(out_fname, 'w+', encoding='utf-8') as f: | ||
f.write('\n'.join(simple_log)) |
Oops, something went wrong.