[GRPO] add reward weight in multi-reward settings #2676

hesamsheikh · 2025-01-28T15:03:35Z

What does this PR do?

As stated by the documentation, in multi-reward-function settings the final reward would be a sum of each reward. This PR is aimed to provide the ability to specify reward weights in multi-reward settings. This provides much more control and flexibility on which rewards require more emphasis. The reward weights support both floats (sum to 1 or not) and ints.

from trl import GRPOTrainer

trainer = GRPOTrainer(
    reward_funcs=[reward_func1, reward_func2],
    reward_weights=[1, 2]
    ...,
)

Before submitting

[ ✅] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[✅ ] Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
[ ✅] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
[✅] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec

Thanks @hesamsheikh! Do you have any references that show this method can be useful?

Superskyyy · 2025-01-28T16:21:37Z

What about there are cases where a verifiable reward will be nulled if a primary reward is turned out to be unsatisfactory. Like if you rate a code snippet with execution. That code is not runnable at all, then all other rewards should be nulled right? That would need more than a weight but a primary setting. If that generalizes to more use cases.

hesamsheikh · 2025-01-28T16:45:30Z

Thanks @hesamsheikh! Do you have any references that show this method can be useful?

the paper actually provides a sneak peek on how their rewards are aggregated:

Finally, we combine the accuracy of
reasoning tasks and the reward for language consistency by directly summing them to form the
final reward. We then apply RL training on the fine-tuned model until it achieves convergence
on reasoning tasks.

However, they only specify summing in the case of using accuracy reward and language consistency reward. In the current implementation, we could provide multiple rewards in the same scope (e.g. format of the output, or accuracy) so it makes sense that a weighted rewarding system can be beneficial. In the example provided in the test file:

        def reward_func1(completions, **kwargs):
            """Reward function that rewards longer completions."""
            return [float(len(completion)) for completion in completions]

        def reward_func2(completions, **kwargs):
            """Reward function that rewards completions with more unique letters."""
            return [float(len(set(completion))) for completion in completions]

both rewards are related to the output format, but a simple summation doesn't give us control over which is more important (e.g. to make the completions longer is much more important than having more unique letters).

The implementation of the weighted sum is straightforward:

# Sum the rewards from all reward functions
rewards = rewards_per_func.sum(dim=1)

is replaced by

rewards = (rewards_per_func * self.reward_weights.to(device).unsqueeze(0)).sum(dim=1)

This allows custom weights rather than no weights (replacing 1 x r1 + 1 x r2 with w1 x r1 + w2 x r2) applied only to the final stage (summation of the rewards). It doesn't break the advantages or the loss function as it is applied to all the rewards. This weighted sum makes the reward aggregation much more flexible in cases of using multiple rewards.

qgallouedec · 2025-01-28T18:11:21Z

Thanks for this detailed explanation. I understand the underlying motivation. I'm wondering if it really helps to get better results? Or whether a naive sum, as is done in the paper actually is enough to get similar results.

hesamsheikh · 2025-01-28T22:28:40Z

Thanks for this detailed explanation. I understand the underlying motivation. I'm wondering if it really helps to get better results? Or whether a naive sum, as is done in the paper actually is enough to get similar results.

In cases where multiple rewards with different priorities need to be tuned, the weighted reward must be more handy. I'm down with doing some experiments if you suggest some.

Benjoyo · 2025-01-31T06:55:41Z

I mean, you can can just pass a single aggregate reward function and do arbitrary weighting there, no?
I don’t quite understand the need for a list of functions anyway, except it is slightly more self-documenting. But you can do the same and more with a single function and I don’t think we should add additional parameters to fix the problems with separation reward functions. What do you think?

Superskyyy · 2025-01-31T14:12:53Z

I mean, you can can just pass a single aggregate reward function and do arbitrary weighting there, no?

I don’t quite understand the need for a list of functions anyway, except it is slightly more self-documenting. But you can do the same and more with a single function and I don’t think we should add additional parameters to fix the problems with separation reward functions. What do you think?

Right, a custom aggregating function to rule them and pass that into trainer sounds like the best way to abstract this need of weighting or whatever many functions. Trainer doesn't really need to know of so many reward functions.

qgallouedec · 2025-01-31T15:28:52Z

I don’t quite understand the need for a list of functions anyway

It allows

to be compatible with reward models (you can mix functions and models)
to log each reward separately

Benjoyo · 2025-02-01T10:45:08Z

I don’t quite understand the need for a list of functions anyway

It allows

to be compatible with reward models (you can mix functions and models)

to log each reward separately

Ok, fair points!

added reward weights for multi-reward runs in GRPO

1c75635

qgallouedec reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRPO] add reward weight in multi-reward settings #2676

[GRPO] add reward weight in multi-reward settings #2676

hesamsheikh commented Jan 28, 2025

qgallouedec left a comment

Superskyyy commented Jan 28, 2025

hesamsheikh commented Jan 28, 2025

qgallouedec commented Jan 28, 2025

hesamsheikh commented Jan 28, 2025

Benjoyo commented Jan 31, 2025

Superskyyy commented Jan 31, 2025 •

edited

Loading

qgallouedec commented Jan 31, 2025

Benjoyo commented Feb 1, 2025

[GRPO] add reward weight in multi-reward settings #2676

Are you sure you want to change the base?

[GRPO] add reward weight in multi-reward settings #2676

Conversation

hesamsheikh commented Jan 28, 2025

What does this PR do?

Before submitting

Who can review?

qgallouedec left a comment

Choose a reason for hiding this comment

Superskyyy commented Jan 28, 2025

hesamsheikh commented Jan 28, 2025

qgallouedec commented Jan 28, 2025

hesamsheikh commented Jan 28, 2025

Benjoyo commented Jan 31, 2025

Superskyyy commented Jan 31, 2025 • edited Loading

qgallouedec commented Jan 31, 2025

Benjoyo commented Feb 1, 2025

Superskyyy commented Jan 31, 2025 •

edited

Loading