-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAC implementation is 2x slower than in stable-baselines #122
Comments
Hi, This was discussed in several threads and the maintainers are aware of the wall discrepancy between SB2 and SB3 and will be addressed as we go towards v1.0. |
I was not aware of that, thank you! |
The first bottleneck is in #112, I have some ideas about further optimisations, but I'd prefer to investigate them first before drawing any conclusions. |
Related issue: #90 Trying to reproduce your results on CPU (8-core laptop), I had similar speed with SB2 vs SB3. SB3 results: import torch as th
from stable_baselines3 import SAC
th.set_num_threads(2)
th.set_num_interop_threads(2)
model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
buffer_size=int(1e6),
batch_size=256,
policy_kwargs={'net_arch': [256, 256]},
learning_starts=0)
model.learn(total_timesteps=1000000, log_interval=4)
# 5k steps
# num_threads=1 55 FPS (interop=None)
# num_threads=1 53 FPS (interop=2)
# num_threads=2 58 FPS (interop=2)
# num_threads=2 51 FPS (interop=None)
# num_threads=3 48 FPS (interop=3)
# num_threads=3 47 FPS (interop=None)
# num_threads=None 52 FPS (interop=None) SB2 results: from stable_baselines import SAC
model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
buffer_size=int(1e6),
batch_size=256,
policy_kwargs={'layers': [256, 256]},
learning_starts=0,
n_cpu_tf_sess=4)
model.learn(total_timesteps=1000000, log_interval=4)
# 5k steps
# n_cpu_tf_sess=1 32 FPS
# n_cpu_tf_sess=2 46 FPS
# n_cpu_tf_sess=4 58 FPS
# n_cpu_tf_sess=None 60 FPS Note: I did not need the |
Here is the colab notebook: https://colab.research.google.com/drive/1GWEmTGczzOZjMIZuqNBysjVPMlyGXSx1?usp=sharing The result: "SB2 is 1.28x faster than SB3". So, there is a slow down, but not as bad as mentioned. I will try with GPU enabled colab and smaller network architecture (to see if there is an influence). EDIT: I've got the same slowdown magnitude on GPU or with a smaller network (SB2 is 1.2x faster) |
Update: after upgrading to pytorch 1.6, the gap seems to be filled: I updated the notebook accordingly. @Miffyli that may interest you too ;) EDIT: apparently on cpu only |
@zajaczajac I will close this issue as the slowdown is not as bad as mentioned (1.02x slower on cpu and 1.2x slower on gpu, see colab notebooks) and mostly comes from pytorch apparently. |
With pytorch 1.7.1 this became an issue again. I see slowdowns of over 50% with sb3 vs. sb2. Both for GPU and CPU training on equivalent model defs. |
using the Colab notebook (linked above), i have: and I have similar results on my latpop (cpu). on GPU colab: |
Pytorch 1.8.0 seems to improve things: "SB2 is 1.40x faster than SB3" (GPU, on colab) |
Pytorch 1.11 (with longer training for better comparison); |
Hello,
First of all, thanks for working on this awesome project!
I've tried to use the SAC implementation and noticed that it works much slower than TF1 version from stable-baselines.
Here is the code for the minimal stable-baselines3 example:
Here is corresponding stable-baselines (TF1) example:
I set the same architecture, number of updates, batch size. So seems all relevant stuff is set the same. However, for PyTorch version I get ~45 FPS, and for TF1 one ~90 FPS.
System Info
Libraries are installed from pip, I have the newest stable-baselines and stable-baselines3, pytorch 1.5.1, tensorflow 1.15.0. I run on CPU. This was run on MacBook pro, I also got similar results on another Linux machine.
Note that I also tried manipulating number of CPU cores, but even the best setting for PyTorch is still 2x slower.
The text was updated successfully, but these errors were encountered: