Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Eval] DiscoveryBench OpenHands Integration #7

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Ethan0456
Copy link
Member

@Ethan0456 Ethan0456 commented Oct 30, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows. Here are the results for the DiscoveryBench test split with gpt-4o and CoderActAgent:

Metric Value
Average Recall Context 0.267
Average Mean Accuracy Score 0.112
Average Final Score 0.103

Give a summary of what the PR does, explaining any non-trivial design decisions

  • This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.

  • Non-trivial design decisions:

    • Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
    • process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

  • run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
    • DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
    • Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
    • Agent configuration: Disabled function calling while enabling Jupyter and browsing delegate configurations in CoderActAgent.
    • Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
    • Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
    • Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

Link of any specific issues this addresses

Link of Older PR this addresses

Ethan0456 and others added 16 commits October 30, 2024 13:58
…s for linting compliance

Signed-off-by: Abhijeetsingh Meena <[email protected]>
…pyter and browsing delegate config

Signed-off-by: Abhijeetsingh Meena <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants