You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In summary, we rely on Environment.step or a tool call to invoke a custom grading behavior. This works fine when doing entire rollouts.
However, when trying to do patterns like zero shot evaluation (e.g. no agent/rollout involved, just an LLM prompt then grading), we have no standard interface to use for something like a ZeroShotEvaluator. It would be nice to build something like this, possible:
classEnvironment(ABC, Generic[TEnvState]):
...
# Reward to use as a placeholder without a reward modelPLACEHOLDER_REWARD: ClassVar[float] =0.0asyncdefget_reward(obs: list[Message]) ->float:
"""Compute a reward given the input messages."""returnself.PLACEHOLDER_REWARDclassHotPotQAEnv(Environment[HotPotQAEnvState]):
...
asyncdefget_reward(obs: list[Message]) ->float:
answer=obs[-1].content# Assume answer is in last messageifanswerisNone:
returnself.incorrect_rewardreturn (
self.correct_rewardif (
awaiteval_answer(
normalize_answer(answer),
self.normalized_correct_answer,
self.evaluation_mode,
)
)
elseself.incorrect_reward
)
The text was updated successfully, but these errors were encountered:
Currently as of
v0.14.0
, we have few different techniques for grading:submit_answer
tool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/gsm8k/src/aviary/envs/gsm8k/env.py#L123-L146submit_answer
tool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/hotpotqa/src/aviary/envs/hotpotqa/env.py#L353-L367paper-qa
as of Moved toMultipleChoiceQuestion
/MultipleChoiceEvaluation
fromaviary
paper-qa#768 is graded insideGradablePaperQAEnvironment.step
via LLM extraction of MC option then string processingIn summary, we rely on
Environment.step
or a tool call to invoke a custom grading behavior. This works fine when doing entire rollouts.However, when trying to do patterns like zero shot evaluation (e.g. no agent/rollout involved, just an LLM prompt then grading), we have no standard interface to use for something like a
ZeroShotEvaluator
. It would be nice to build something like this, possible:The text was updated successfully, but these errors were encountered: