Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAC implementation is 2x slower than in stable-baselines #122

Closed
michalzajac-ml opened this issue Jul 24, 2020 · 11 comments
Closed

SAC implementation is 2x slower than in stable-baselines #122

michalzajac-ml opened this issue Jul 24, 2020 · 11 comments

Comments

@michalzajac-ml
Copy link

michalzajac-ml commented Jul 24, 2020

Hello,
First of all, thanks for working on this awesome project!
I've tried to use the SAC implementation and noticed that it works much slower than TF1 version from stable-baselines.
Here is the code for the minimal stable-baselines3 example:

import os

import gym
import torch
from stable_baselines3 import SAC
from stable_baselines3.sac.policies import MlpPolicy

os.environ['CUDA_VISIBLE_DEVICES'] = ''

torch.set_num_threads(2)

env = gym.make('Pendulum-v0')

model = SAC(MlpPolicy, env, verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'net_arch': [256, 256],
                           'activation_fn': torch.nn.ReLU})
model.learn(total_timesteps=1000000, log_interval=10)

Here is corresponding stable-baselines (TF1) example:

import os

import gym
import tensorflow as tf
from stable_baselines import SAC
from stable_baselines.sac.policies import MlpPolicy

os.environ['CUDA_VISIBLE_DEVICES'] = ''

env = gym.make('Pendulum-v0')

model = SAC(MlpPolicy, env, verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'layers': [256, 256], 'act_fun': tf.nn.relu},
            n_cpu_tf_sess=2)
model.learn(total_timesteps=1000000, log_interval=10)

I set the same architecture, number of updates, batch size. So seems all relevant stuff is set the same. However, for PyTorch version I get ~45 FPS, and for TF1 one ~90 FPS.

System Info
Libraries are installed from pip, I have the newest stable-baselines and stable-baselines3, pytorch 1.5.1, tensorflow 1.15.0. I run on CPU. This was run on MacBook pro, I also got similar results on another Linux machine.
Note that I also tried manipulating number of CPU cores, but even the best setting for PyTorch is still 2x slower.

@m-rph
Copy link
Contributor

m-rph commented Jul 25, 2020

Hi,

This was discussed in several threads and the maintainers are aware of the wall discrepancy between SB2 and SB3 and will be addressed as we go towards v1.0.

@michalzajac-ml
Copy link
Author

I was not aware of that, thank you!
Btw do you have any insight where the bottleneck could be?

@m-rph
Copy link
Contributor

m-rph commented Jul 25, 2020

The first bottleneck is in #112, I have some ideas about further optimisations, but I'd prefer to investigate them first before drawing any conclusions.

@araffin
Copy link
Member

araffin commented Jul 25, 2020

Btw do you have any insight where the bottleneck could be?

Related issue: #90
Also note that this SAC implementation is a bit different from the one in SB2 (we use two q-value target networks instead of a value network to match latest original implementation)

Trying to reproduce your results on CPU (8-core laptop), I had similar speed with SB2 vs SB3.
I did not experience the dramatic slow-down that you mentioned (SB2 is even slower when n_cpu_tf_sess is not the default value).

SB3 results:

import torch as th
from stable_baselines3 import SAC

th.set_num_threads(2)
th.set_num_interop_threads(2)

model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'net_arch': [256, 256]},
            learning_starts=0)
model.learn(total_timesteps=1000000, log_interval=4)

# 5k steps
# num_threads=1 55 FPS (interop=None)
# num_threads=1 53 FPS (interop=2)
# num_threads=2 58 FPS (interop=2)
# num_threads=2 51 FPS (interop=None)
# num_threads=3 48 FPS (interop=3)
# num_threads=3 47 FPS (interop=None)
# num_threads=None 52 FPS (interop=None)

SB2 results:

from stable_baselines import SAC

model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'layers': [256, 256]},
            learning_starts=0,
            n_cpu_tf_sess=4)
model.learn(total_timesteps=1000000, log_interval=4)


# 5k steps
# n_cpu_tf_sess=1 32 FPS
# n_cpu_tf_sess=2 46 FPS
# n_cpu_tf_sess=4 58 FPS
# n_cpu_tf_sess=None 60 FPS

Note: I did not need the os.environ['CUDA_VISIBLE_DEVICES'] = '' because I don't have any GPU.
I will try again using colab notebook later.

@araffin
Copy link
Member

araffin commented Jul 25, 2020

Here is the colab notebook: https://colab.research.google.com/drive/1GWEmTGczzOZjMIZuqNBysjVPMlyGXSx1?usp=sharing

The result: "SB2 is 1.28x faster than SB3". So, there is a slow down, but not as bad as mentioned.

I will try with GPU enabled colab and smaller network architecture (to see if there is an influence).

EDIT: I've got the same slowdown magnitude on GPU or with a smaller network (SB2 is 1.2x faster)

@araffin
Copy link
Member

araffin commented Jul 30, 2020

Update: after upgrading to pytorch 1.6, the gap seems to be filled: SB2 is only 1.02x faster than SB3

I updated the notebook accordingly.

@Miffyli that may interest you too ;)

EDIT: apparently on cpu only

@araffin
Copy link
Member

araffin commented Aug 3, 2020

@zajaczajac I will close this issue as the slowdown is not as bad as mentioned (1.02x slower on cpu and 1.2x slower on gpu, see colab notebooks) and mostly comes from pytorch apparently.

@araffin araffin closed this as completed Aug 3, 2020
@adam515
Copy link

adam515 commented Feb 28, 2021

With pytorch 1.7.1 this became an issue again. I see slowdowns of over 50% with sb3 vs. sb2. Both for GPU and CPU training on equivalent model defs.

@araffin
Copy link
Member

araffin commented Feb 28, 2021

With pytorch 1.7.1 this became an issue again. I see slowdowns of over 50% with sb3 vs. sb2. Both for GPU and CPU training on equivalent model defs.

using the Colab notebook (linked above), i have:
"SB2 is 1.27x faster than SB3" (cpu)

and I have similar results on my latpop (cpu).

on GPU colab:
"SB2 is 1.57x faster than SB3"

@araffin
Copy link
Member

araffin commented Mar 5, 2021

Pytorch 1.8.0 seems to improve things:
"SB2 is 1.16x faster than SB3" (CPU, on colab)

"SB2 is 1.40x faster than SB3" (GPU, on colab)

@araffin
Copy link
Member

araffin commented Mar 11, 2022

Pytorch 1.11 (with longer training for better comparison);
"SB2 is 1.07x faster than SB3" (CPU, on colab)
"SB2 is 1.52x faster than SB3" (GPU, on colab)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants