Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete relabelling of trajectories in HER #96

Closed
rstrudel opened this issue Feb 5, 2020 · 5 comments
Closed

Incomplete relabelling of trajectories in HER #96

rstrudel opened this issue Feb 5, 2020 · 5 comments

Comments

@rstrudel
Copy link

rstrudel commented Feb 5, 2020

Hi,

First, thanks a lot for releasing such a nice repository. I have been using it for a few months now and I appreciate that it is quite well written and most of the code is self explanatory. I learned a lot just by using it.

I am using SAC-HER and got a lot of divergence issues which I fixed in the end. One of the main problem came from the relabelling of samples in the buffer: https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/data_management/obs_dict_replay_buffer.py#L228-L239
Given a batch, a set of new rewards is computed according to the updated set of goals. However the terminals are not updated, whereas some states might be terminal given the new goal.

And this has its importance in the Bellman update where the terminals variable appear https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/torch/sac/sac.py#L128

If the reward if of the form -distance(state, goal) and an episode is only finished because of the maximum path length, then not updating the terminals will have little impact. It may be why this bug passed silently. However I am working with a spare reward which is 1 if distance(state, goal) < epsilon. And in this case, if terminals are not updated then the Q-function blows up. Indeed, if we assume that target_q_values = 1 at the goal, if terminals = 0 then q_target = 2, at the next iteration q_target = 3 and so on. If terminals = 1, i.e. if the state is terminal according to the resampled goal then q_target = 1.

So in my fork of your repository, I replaced:

new_rewards = self.env.compute_rewards(
                new_actions,
                new_next_obs_dict,
            )

by

new_rewards, new_terminals = self.env.compute_rewards(
                new_actions,
                new_next_obs_dict,
            )

where terminals is 1 if distance(state, goal) < epsilon in my case. This fixed the Q-function blow-up issue.

@rstrudel
Copy link
Author

rstrudel commented Feb 5, 2020

I hope this makes sense, I did not propose a PR because it will break compatibility with existing environments.

@vitchyr
Copy link
Collaborator

vitchyr commented Feb 11, 2020

That's a good point! That also makes a lot of sense as to why updating the termination conditioned would help.

One thing that's ambiguous is whether or not the episode should end when the state is reached. Some authors assume that it should while others assume that it shouldn't. For example, the HER code released by the authors don't seem to update the termination condition.

At this point, I'm not sure if I'd feel comfortable making this change for everyone, since it may not be the intended behavior and it requires changing the compute_rewards interface. Perhaps if you make it an option inside of the HER code in a PR, that could be nice a way to make it optional without requiring the interface to change too drastically.

For now I'll pin this issue so that it has more visibility, and feel free to make a PR as suggested above.

@vitchyr vitchyr closed this as completed Feb 11, 2020
@vitchyr vitchyr pinned this issue Feb 11, 2020
@rstrudel
Copy link
Author

rstrudel commented Mar 5, 2020

Thanks @vitchyr for this insightful answer.

One thing that's ambiguous is whether or not the episode should end when the state is reached.

Agreed. Indeed, HER authors have tested the algorithm on environments like gym fetch where an episode terminates only once the horizon is reached. Also, whether it is sparse or dense, they use rewards that are negative and converge to 0 when the state is getting closer to the goal, which avoid blow up issues such as the one I pointed out when the reward is positive near the goal.

@bycn
Copy link

bycn commented Jun 9, 2020

@rstrudel Did you find that the horizon termination made a difference in your experiments? (Only that part, not the -1 and 0 part)
Or if anyone knows if this is a big factor/ point me towards some papers that address this, please let me know

@rstrudel
Copy link
Author

rstrudel commented Sep 1, 2020

@bycn for my application which is neural motion planning, it did make a difference. In my experiments I got the best results setting the termination when the goal is reached and providing a positive reward in this case. It might be quite experiment dependent however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants