-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Evaluation] DiscoveryBench OpenHands Integration #4562
base: main
Are you sure you want to change the base?
[Evaluation] DiscoveryBench OpenHands Integration #4562
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Ethan0456 , thanks so much for this, this is exciting!
One question: did you run any of the agents against discoverybench yet? If so, it'd be good to know the results.
We just made a major improvement to the agents, so we'll probably want to run it again, but I just wanted to check for now: https://x.com/gneubig/status/1849874034810618180
…ates (All-Hands-AI#4564) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… in the version name (All-Hands-AI#4580)
Thanks for updating us about the agent improvements, @neubig. We added code to accommodate the We have run our agents for a subset of discoverybench (
We are also running the evals for the rest and will keep you posted! cc: @majumderb @pclark425 |
There seems to be a merge mishap in this PR? |
@neubig here are the results for the entire DiscoveryBench with
|
Do you think it will be easier to start with a fresh branch to manage merge mishaps @neubig @xingyaoww @tobitege? NB: We are trying to wrap this PR by 31st Oct as @Ethan0456 who is leading the integration is tied next week with other deadlines. |
Hi @Ethan0456 and @suranah , thanks a lot! I looked at the code and it does seem that a lot of unrelated sections have been changed, so it'd be good if you could either revert those or open up a new branch. Sorry about the hassle! |
Hi @neubig, Thank you so much for your prompt feedback! I think it would be best to open a fresh PR. |
End-user friendly description of the problem this fixes or functionality that this introduces
This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.
https://github.com/allenai/discoverybench/
https://x.com/mbodhisattwa/status/1811524569410531333
With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows.
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.
Non-trivial design decisions:
How we structured everything in run_infer.py
test_result
dictionary, which is ultimately written to anoutput.jsonl
file for analysis and review.Link of any specific issues this addresses