Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Multi-turn support #3059

Open
yifanmai opened this issue Oct 10, 2024 · 0 comments
Open

Proposal: Multi-turn support #3059

yifanmai opened this issue Oct 10, 2024 · 0 comments

Comments

@yifanmai
Copy link
Collaborator

Background

Currently, HELM only supports single-turn scenarios. The main evaluation flow in Runner.run_one() works at follows:

  1. Get instances from the scenario
  2. Get prompts for the instances from the adapter
  3. Run prompts through the model to get outputs
  4. Score outputs on metrics

However, some scenarios need to look at outputs from the model, and then send more prompts to the model. Examples:

  • In MT-Bench, the model is given an initial prompt, and then the it is given a follow up prompt with the context of the first interaction.
  • In SOTOPIA, two models converse for several turns, and a third model grades the conversation.

Proposal

Add a loop to Runner.run_one() around steps 2 and 3. After running prompts through the model, call the adapter again to get additional prompts, and then run those through the model. Repeat until the adapter stops producing new prompts.

Concretely, the Adapter will have a new method for generating more request states:

class Adapter(ABC):
    def adapt_next_turn(self, instances: List[Instance], scenario_state: ScenarioState, parallelism: int) -> List[RequestState]:
        """
        Takes `Instance`s and all previous `RequestState`s, and returns new `RequestState`s.
        """
        return []

In the frontend, all requests will be grouped under the same instance and displayed.

Pros

Because the next turn for all instances are generated and processed simultaneously in each instance of the loop, we get thread-level parallelism when processing requests to the models.

Cons

The API is somewhat unnatural if the user is thinking in terms of the chronology of a single conversation, since the method requires operating on all conversations at once. The user thus needs to perform some internal bookeeping. We could alleviate this by providing bookkeeping utilities in utility classes or functions.

This doesn't address the high storage costs of multi-turn conversations, which is O(N^2), since we keep all N requests, and the length of the Nth request grows with N since it includes all previous context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant