A question about rl training function #49

ZefanW · 2019-05-05T01:06:11Z

for action, p, r, b in zip(indices, probs, reward, baseline): advantage = r - b avg_advantage += advantage losses.append(-p.log_prob(action) * (advantage/len(indices))) # divide by T*B

I have a question about this piece of code.
If I didn't get it wrong, the variable b here is tensor with gradient enabled, so optimizing tensors in losses will actually both optimize reward by changing policy weights and minimizing the advantage by maximizing baseline. I can't understand why the baseline is optimized here, because as far as I know, the baseline should only be optimized during the training of the critic.
Actually I used this training function in a different summarization task, and I found that the avg_advantage is always dropping.
Thank you very much.

The text was updated successfully, but these errors were encountered:

ZefanW · 2019-05-05T01:06:54Z

I changed r-b to (r-b).item(), and it seems alright.

ChenRocks · 2019-05-08T07:15:42Z

Thanks for pointing this out! I think your solution should work as intended. I will test how this affect the results when I have time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about rl training function #49

A question about rl training function #49

ZefanW commented May 5, 2019

ZefanW commented May 5, 2019

ChenRocks commented May 8, 2019

A question about rl training function #49

A question about rl training function #49

Comments

ZefanW commented May 5, 2019

ZefanW commented May 5, 2019

ChenRocks commented May 8, 2019