Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does SB3's DQN fails on a custom environment but SB2's DQN does not? #223

Closed
nbro opened this issue Nov 16, 2020 · 5 comments
Closed
Labels
custom gym env Issue related to Custom Gym Env more information needed Please fill the issue template completely question Further information is requested RTFM Answer is the documentation

Comments

@nbro
Copy link

nbro commented Nov 16, 2020

There are several issues related to the performance of SB2 and SB3, such as this one. Here, I am specifically focusing on DQN's behavior. I am using a custom environment (simple 4x4 grid world where the goal is to get from one cell to another). I am using the equivalent code in SB2 and SB3 to train and evaluate the RL model/algorithm.

Specifically, this is the code I am using with SB2

from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.bench.monitor import Monitor
from stable_baselines.results_plotter import X_TIMESTEPS, plot_results
from stable_baselines.deepq.dqn import DQN
from stable_baselines.deepq.policies import MlpPolicy

...

model = DQN(MlpPolicy, env, verbose=1, exploration_fraction=0.1)
model.learn(total_timesteps=20000)

And the (supposed) SB3 equivalent code is only different in the imports, which are

from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import X_TIMESTEPS, plot_results
from stable_baselines3.dqn.dqn import DQN
from stable_baselines3.dqn.policies import MlpPolicy

With SB2, after training, my model regularly achieves the best performance (reaches the goal location and gets the highest amount of reward). On the other hand, with SB3, the model is never able to reach the goal during evaluation (with the same number of time steps or even if I increase the number of time steps). Not sure why. Clearly, there are big differences between SB2 and SB3, apart from the fact that SB2 uses TF 1 and SB3 PyTorch. However, it's also true that, during training, the SB3 implementation eventually gets to the goal location (according to the reward received, which I am plotting with plot_results after having kept track of them with Monitor). However, as I just said, during evaluation, sometimes it just gets stuck by using the same apparently invalid action.

Here's the code used during evaluation to take actions (for both SB2 and SB3).

action, hidden_states = model.predict(next_observation, deterministic=True)
next_observation, reward, done, info = env.step(action)

(Also, weirdly enough, sometimes done = True, but the final reward is zero (although it should be 1 in that case), but this is another issue.)


Update 1

Actually, now, the SB2 version also fails. This happened after having installed SB3 too. However, I already created a new environment without SB3, so SB3 is not the issue. I know that these algorithms are stochastic, but it seems strange that, from one run to the other, the results can be completely different, in such a simple environment, after so many time steps, i.e. 20000.

Now, I increased the number of time steps to 30000, and the SB2 seems to work again (but maybe it fails again in a moment, lol). Btw, the SB3 version still fails.

@nbro nbro added the question Further information is requested label Nov 16, 2020
@araffin araffin added the custom gym env Issue related to Custom Gym Env label Nov 16, 2020
@araffin
Copy link
Member

araffin commented Nov 16, 2020

Hello,

Short answer: as mentioned several times, we do not do tech support, please read the RL Tips and Tricks and the migration guide carefully
next time, please also use the issue template ;)

Long answer:

There are several issues related to the performance of SB2 and SB3, such as this one.

performances were checked

I am using the equivalent code in SB2 and SB3 to train and evaluate the RL model/algorithm.

Please read the migration guide

SB2 and SB3 DQN are quite different if you use the default hyperparameters.

Now, I increased the number of time steps to 30000

30k steps does not seem much and you will need probably to do some hyperparameter tuning.

Last but not least, take your time. You have opened many issues in the last days. As mentioned in the doc, you should probably do some hyperparameter tuning and don't expect everything to work out of the box without any tuning.

So, if you think there is an issue with SB3, please fill up the issue template completely (so we can reproduce the potential problem) but take your time, we do not do tech support.

@araffin araffin added the more information needed Please fill the issue template completely label Nov 16, 2020
@nbro
Copy link
Author

nbro commented Nov 16, 2020

@araffin I knew about that migration guide and "tips and tricks" (though I did not read them yet), but someone that uses SB2, and then switches to SB3 (thinking that SB3 is a better version because of the 3 instead of the 2), and nothing works, that's an issue, even if you think it's not.

30k steps does not seem much and you will need probably to do some hyperparameter tuning.

It's a 4x4 grid, as I wrote above. I think that's a lot of time steps for almost the simplest environment that you can imagine.

SB2 and SB3 DQN are quite different if you use the default hyperparameters.

Why? Maybe you should call SB3 with a different name, just saying.

Last but not least, take your time. You have opened many issues in the last days.

Yes. All of them without an answer by just googling. That's why I opened them, so that they can also be useful for future users. So, I don't really see the problem here. If you do not have the time to answer, just don't answer.

@nbro nbro closed this as completed Nov 16, 2020
@araffin
Copy link
Member

araffin commented Nov 16, 2020

Why? Maybe you should call SB3 with a different name, just saying.

The issue comes from SB2 in fact. It should have been called "double-dueling-prioritized" DQN.
For the why, the answer is in the migration guide that you are aware of ;) (and in my previous message too)

It's a 4x4 grid, as I wrote above. I think that's a lot of time steps for almost the simplest environment that you can imagine.

Well, if you use the default hyperparams tuned for Atari, there is no guarantee that it will work using a small amount of timesteps.

@araffin araffin added the RTFM Answer is the documentation label Nov 16, 2020
@nbro
Copy link
Author

nbro commented Nov 16, 2020

The issue comes from SB2 in fact. It should have been called "double-dueling-prioritized" DQN.

Not sure why SB3 only provides vanilla DQN, while double and dueling DQN were supposedly proposed to be better versions of the vanilla one (though I am not claiming that these new versions perform better than DQN in all cases). Moreover, SB2 has an option to disable double DQN, but no option to disable dueling (is there such an option or did I miss something?). Anyway, I disabled double DQN (i.e. the use of 2 networks, as described in the paper, I suppose), and it makes the SB2 version fail, so I suppose that double DQN performs better than DQN on this simple task. I would like to note that DQN was originally proposed with experience replay, so any implementation of DQN is expected to provide ER anyway.

Well, if you use the default hyperparams tuned for Atari, there is no guarantee that it will work using a small amount of timesteps.

Ok. I suppose that I will need to do some tuning to achieve the same thing in SB3, but not providing an implementation of DDQN or duelling DQN seems like a drawback of SB3. Double DQN is just the use of the target network for calculating the target (rather than using the online network), so I am not sure why at least DDQN was not implemented.

@araffin
Copy link
Member

araffin commented Nov 16, 2020

, so I am not sure why at least DDQN was not implemented.

#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env more information needed Please fill the issue template completely question Further information is requested RTFM Answer is the documentation
Projects
None yet
Development

No branches or pull requests

2 participants