[Evaluation] DiscoveryBench OpenHands Integration #4562

Ethan0456 · 2024-10-25T15:07:26Z

End-user friendly description of the problem this fixes or functionality that this introduces

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.
https://github.com/allenai/discoverybench/
https://x.com/mbodhisattwa/status/1811524569410531333

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows.

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.
Non-trivial design decisions:
- Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
- process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
- DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
- Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
- Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
- Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
- Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

Link of any specific issues this addresses

This PR addresses issue [Evaluation] Add DiscoveryBench Benchmark #4465

…ation

neubig

Hey @Ethan0456 , thanks so much for this, this is exciting!

One question: did you run any of the agents against discoverybench yet? If so, it'd be good to know the results.

We just made a major improvement to the agents, so we'll probably want to run it again, but I just wanted to check for now: https://x.com/gneubig/status/1849874034810618180

…ll-Hands-AI#4566)

…ates (All-Hands-AI#4564) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ervation (All-Hands-AI#4573)

…-AI#4511)

… in the version name (All-Hands-AI#4580)

suranah · 2024-10-28T07:52:30Z

Thanks for updating us about the agent improvements, @neubig. We added code to accommodate the runtime changes.

We have run our agents for a subset of discoverybench (nls_incarceration) with gpt-4o. Here are the results for the 28 instances:

Metric	Value
Average Recall Context	0.321
Average Mean Accuracy Score	0.195
Average Final Score	0.195

We are also running the evals for the rest and will keep you posted!

cc: @majumderb @pclark425

tobitege · 2024-10-28T08:03:04Z

There seems to be a merge mishap in this PR?

suranah · 2024-10-29T14:07:10Z

@neubig here are the results for the entire DiscoveryBench with gpt-4o and CoderActAgent:

Metric	Value
Average Recall Context	0.267
Average Mean Accuracy Score	0.112
Average Final Score	0.103

suranah · 2024-10-29T14:11:27Z

Do you think it will be easier to start with a fresh branch to manage merge mishaps @neubig @xingyaoww @tobitege?

NB: We are trying to wrap this PR by 31st Oct as @Ethan0456 who is leading the integration is tied next week with other deadlines.

neubig · 2024-10-29T14:33:55Z

Hi @Ethan0456 and @suranah , thanks a lot!

I looked at the code and it does seem that a lot of unrelated sections have been changed, so it'd be good if you could either revert those or open up a new branch. Sorry about the hassle!

Ethan0456 · 2024-10-29T14:55:40Z

Hi @neubig,

Thank you so much for your prompt feedback! I think it would be best to open a fresh PR.

suranah · 2024-10-30T14:40:45Z

Hey @neubig, we have opened a fresh PR here: #4627

Ethan0456 and others added 24 commits October 10, 2024 15:32

init: add discoverybench files

961374a

init: add discoverybench evaluation bash script

534adad

refactor: move utils to eval_utils/

d28cef0

refactor: reduce redundancy in log extraction function

ef4796f

Merge branch 'All-Hands-AI:main' into discoverybench-openhands-integr…

cb5b369

…ation

fix: modify response parser

682f151

chore: remove useless modules

54949ea

fix: update discoverybench evaluation

903a00d

feat: initialize runtime with libraries

985eedf

init: add README

a97319b

docs: update README to add todo

ec2721b

Create README.md

2f3689c

docs: Update run_infer.py to add TODO for docstrings

622edf2

docs: add function doc strings

e62082a

docs: add one line eval utils descriptions in README

26a831f

docs: Update README.md for more clarity

cf1f3c1

docs: Update README.md for more clarity on DiscoveryBench process

a9673c5

docs: Update utils README.md

81c8271

docs: Update discoverybench README.md to eval context

edc134f

docs: Update formatting for discoverybench README.md

6337c52

docs: Update README.md for clarity

fee00c3

chore: remove redundant comments

811fb7d

fix: clean up README formatting to pass linting

32b1e4a

Merge branch 'main' into eval/discoverybench-openhands-integration

4fb8ab6

neubig self-requested a review October 25, 2024 18:49

neubig reviewed Oct 25, 2024

View reviewed changes

tofarr and others added 4 commits October 28, 2024 08:53

Fix for docker leak (All-Hands-AI#4560)

c834796

feat(eval): rewrite log_completions to save completions to directory (A…

9571dc6

…ll-Hands-AI#4566)

fix(eval): add runtime.connect to all eval harness (All-Hands-AI#4565)

ac07dce

Small refactor : EventStream as a dataclass (All-Hands-AI#4557)

71a28eb

dependabot bot and others added 8 commits October 28, 2024 08:53

chore(deps): bump the version-all group across 1 directory with 8 upd…

d328e0b

…ates (All-Hands-AI#4564) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

fix(controllor): make agent controller stops when encounter fatal obs…

21e3204

…ervation (All-Hands-AI#4573)

fix(controller): stop when run into loop (All-Hands-AI#4579)

409eeff

Mention build-essential dependency for ubuntu in dev doc (All-Hands…

96de4b5

…-AI#4511)

fix(builder): Build the runtime with docker version that contains (-)…

07947cd

… in the version name (All-Hands-AI#4580)

Remove verbose log from agent controller (All-Hands-AI#4585)

ba34d22

fix: add runtime.connect to discoverybench eval harness

fda3110

fix: unpack two return values from get_dv_query_for_real

ad9f4c8

This was referenced Oct 30, 2024

[Eval] DiscoveryBench OpenHands Integration openlocus/OpenHands#7

Open

[Eval] DiscoveryBench OpenHands Integration #4627

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] DiscoveryBench OpenHands Integration #4562

[Evaluation] DiscoveryBench OpenHands Integration #4562

Ethan0456 commented Oct 25, 2024

neubig left a comment

suranah commented Oct 28, 2024

tobitege commented Oct 28, 2024

suranah commented Oct 29, 2024

suranah commented Oct 29, 2024

neubig commented Oct 29, 2024 •

edited

Loading

Ethan0456 commented Oct 29, 2024

suranah commented Oct 30, 2024

[Evaluation] DiscoveryBench OpenHands Integration #4562

Are you sure you want to change the base?

[Evaluation] DiscoveryBench OpenHands Integration #4562

Conversation

Ethan0456 commented Oct 25, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

Give a summary of what the PR does, explaining any non-trivial design decisions

How we structured everything in run_infer.py

Link of any specific issues this addresses

neubig left a comment

Choose a reason for hiding this comment

suranah commented Oct 28, 2024

tobitege commented Oct 28, 2024

suranah commented Oct 29, 2024

suranah commented Oct 29, 2024

neubig commented Oct 29, 2024 • edited Loading

Ethan0456 commented Oct 29, 2024

suranah commented Oct 30, 2024

neubig commented Oct 29, 2024 •

edited

Loading