Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Evaluation] DiscoveryBench OpenHands Integration #4562

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

Ethan0456
Copy link
Contributor

End-user friendly description of the problem this fixes or functionality that this introduces

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.
https://github.com/allenai/discoverybench/
https://x.com/mbodhisattwa/status/1811524569410531333

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows.


Give a summary of what the PR does, explaining any non-trivial design decisions

  • This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.

  • Non-trivial design decisions:

    • Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
    • process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

  • run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
    • DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
    • Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
    • Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
    • Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
    • Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

Link of any specific issues this addresses

@neubig neubig self-requested a review October 25, 2024 18:49
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Ethan0456 , thanks so much for this, this is exciting!

One question: did you run any of the agents against discoverybench yet? If so, it'd be good to know the results.

We just made a major improvement to the agents, so we'll probably want to run it again, but I just wanted to check for now: https://x.com/gneubig/status/1849874034810618180

@suranah
Copy link
Contributor

suranah commented Oct 28, 2024

Thanks for updating us about the agent improvements, @neubig. We added code to accommodate the runtime changes.

We have run our agents for a subset of discoverybench (nls_incarceration) with gpt-4o. Here are the results for the 28 instances:

Metric Value
Average Recall Context 0.321
Average Mean Accuracy Score 0.195
Average Final Score 0.195

We are also running the evals for the rest and will keep you posted!

cc: @majumderb @pclark425

@tobitege
Copy link
Collaborator

There seems to be a merge mishap in this PR?

@suranah
Copy link
Contributor

suranah commented Oct 29, 2024

@neubig here are the results for the entire DiscoveryBench with gpt-4o and CoderActAgent:

Metric Value
Average Recall Context 0.267
Average Mean Accuracy Score 0.112
Average Final Score 0.103

@suranah
Copy link
Contributor

suranah commented Oct 29, 2024

Do you think it will be easier to start with a fresh branch to manage merge mishaps @neubig @xingyaoww @tobitege?

NB: We are trying to wrap this PR by 31st Oct as @Ethan0456 who is leading the integration is tied next week with other deadlines.

@neubig
Copy link
Contributor

neubig commented Oct 29, 2024

Hi @Ethan0456 and @suranah , thanks a lot!

I looked at the code and it does seem that a lot of unrelated sections have been changed, so it'd be good if you could either revert those or open up a new branch. Sorry about the hassle!

@Ethan0456
Copy link
Contributor Author

Hi @neubig,

Thank you so much for your prompt feedback! I think it would be best to open a fresh PR.

@suranah
Copy link
Contributor

suranah commented Oct 30, 2024

Hey @neubig, we have opened a fresh PR here: #4627

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants