Formalizations around grading/reward model #162

jamesbraza · 2024-12-19T23:03:36Z

Currently as of v0.14.0, we have few different techniques for grading:

GSM8k is graded via string processing in its submit_answer tool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/gsm8k/src/aviary/envs/gsm8k/env.py#L123-L146
HotPotQA is graded via string processing in its submit_answer tool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/hotpotqa/src/aviary/envs/hotpotqa/env.py#L353-L367
paper-qa as of Moved to MultipleChoiceQuestion/MultipleChoiceEvaluation from aviary paper-qa#768 is graded inside GradablePaperQAEnvironment.step via LLM extraction of MC option then string processing

In summary, we rely on Environment.step or a tool call to invoke a custom grading behavior. This works fine when doing entire rollouts.

However, when trying to do patterns like zero shot evaluation (e.g. no agent/rollout involved, just an LLM prompt then grading), we have no standard interface to use for something like a ZeroShotEvaluator. It would be nice to build something like this, possible:

class Environment(ABC, Generic[TEnvState]):
    ...

    # Reward to use as a placeholder without a reward model
    PLACEHOLDER_REWARD: ClassVar[float] = 0.0

    async def get_reward(obs: list[Message]) -> float:
        """Compute a reward given the input messages."""
        return self.PLACEHOLDER_REWARD


class HotPotQAEnv(Environment[HotPotQAEnvState]):
    ...

    async def get_reward(obs: list[Message]) -> float:
        answer = obs[-1].content  # Assume answer is in last message
        if answer is None:
            return self.incorrect_reward
        return (
            self.correct_reward
            if (
                await eval_answer(
                    normalize_answer(answer),
                    self.normalized_correct_answer,
                    self.evaluation_mode,
                )
            )
            else self.incorrect_reward
        )

The text was updated successfully, but these errors were encountered:

jamesbraza added the enhancement New feature or request label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formalizations around grading/reward model #162

Formalizations around grading/reward model #162

jamesbraza commented Dec 19, 2024

Formalizations around grading/reward model #162

Formalizations around grading/reward model #162

Comments

jamesbraza commented Dec 19, 2024