SAC implementation is 2x slower than in stable-baselines #122

michalzajac-ml · 2020-07-24T20:17:09Z

Hello,
First of all, thanks for working on this awesome project!
I've tried to use the SAC implementation and noticed that it works much slower than TF1 version from stable-baselines.
Here is the code for the minimal stable-baselines3 example:

import os

import gym
import torch
from stable_baselines3 import SAC
from stable_baselines3.sac.policies import MlpPolicy

os.environ['CUDA_VISIBLE_DEVICES'] = ''

torch.set_num_threads(2)

env = gym.make('Pendulum-v0')

model = SAC(MlpPolicy, env, verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'net_arch': [256, 256],
                           'activation_fn': torch.nn.ReLU})
model.learn(total_timesteps=1000000, log_interval=10)

Here is corresponding stable-baselines (TF1) example:

import os

import gym
import tensorflow as tf
from stable_baselines import SAC
from stable_baselines.sac.policies import MlpPolicy

os.environ['CUDA_VISIBLE_DEVICES'] = ''

env = gym.make('Pendulum-v0')

model = SAC(MlpPolicy, env, verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'layers': [256, 256], 'act_fun': tf.nn.relu},
            n_cpu_tf_sess=2)
model.learn(total_timesteps=1000000, log_interval=10)

I set the same architecture, number of updates, batch size. So seems all relevant stuff is set the same. However, for PyTorch version I get ~45 FPS, and for TF1 one ~90 FPS.

System Info
Libraries are installed from pip, I have the newest stable-baselines and stable-baselines3, pytorch 1.5.1, tensorflow 1.15.0. I run on CPU. This was run on MacBook pro, I also got similar results on another Linux machine.
Note that I also tried manipulating number of CPU cores, but even the best setting for PyTorch is still 2x slower.

m-rph · 2020-07-25T08:31:15Z

Hi,

This was discussed in several threads and the maintainers are aware of the wall discrepancy between SB2 and SB3 and will be addressed as we go towards v1.0.

michalzajac-ml · 2020-07-25T08:46:17Z

I was not aware of that, thank you!
Btw do you have any insight where the bottleneck could be?

m-rph · 2020-07-25T09:16:32Z

The first bottleneck is in #112, I have some ideas about further optimisations, but I'd prefer to investigate them first before drawing any conclusions.

araffin · 2020-07-25T09:47:15Z

Btw do you have any insight where the bottleneck could be?

Related issue: #90
Also note that this SAC implementation is a bit different from the one in SB2 (we use two q-value target networks instead of a value network to match latest original implementation)

Trying to reproduce your results on CPU (8-core laptop), I had similar speed with SB2 vs SB3.
I did not experience the dramatic slow-down that you mentioned (SB2 is even slower when n_cpu_tf_sess is not the default value).

SB3 results:

import torch as th
from stable_baselines3 import SAC

th.set_num_threads(2)
th.set_num_interop_threads(2)

model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'net_arch': [256, 256]},
            learning_starts=0)
model.learn(total_timesteps=1000000, log_interval=4)

# 5k steps
# num_threads=1 55 FPS (interop=None)
# num_threads=1 53 FPS (interop=2)
# num_threads=2 58 FPS (interop=2)
# num_threads=2 51 FPS (interop=None)
# num_threads=3 48 FPS (interop=3)
# num_threads=3 47 FPS (interop=None)
# num_threads=None 52 FPS (interop=None)

SB2 results:

from stable_baselines import SAC

model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
            buffer_size=int(1e6),
            batch_size=256,
            policy_kwargs={'layers': [256, 256]},
            learning_starts=0,
            n_cpu_tf_sess=4)
model.learn(total_timesteps=1000000, log_interval=4)


# 5k steps
# n_cpu_tf_sess=1 32 FPS
# n_cpu_tf_sess=2 46 FPS
# n_cpu_tf_sess=4 58 FPS
# n_cpu_tf_sess=None 60 FPS

Note: I did not need the os.environ['CUDA_VISIBLE_DEVICES'] = '' because I don't have any GPU.
I will try again using colab notebook later.

araffin · 2020-07-25T10:20:28Z

Here is the colab notebook: https://colab.research.google.com/drive/1GWEmTGczzOZjMIZuqNBysjVPMlyGXSx1?usp=sharing

The result: "SB2 is 1.28x faster than SB3". So, there is a slow down, but not as bad as mentioned.

I will try with GPU enabled colab and smaller network architecture (to see if there is an influence).

EDIT: I've got the same slowdown magnitude on GPU or with a smaller network (SB2 is 1.2x faster)

araffin · 2020-07-30T16:53:15Z

Update: after upgrading to pytorch 1.6, the gap seems to be filled: SB2 is only 1.02x faster than SB3

I updated the notebook accordingly.

@Miffyli that may interest you too ;)

EDIT: apparently on cpu only

araffin · 2020-08-03T08:34:10Z

@zajaczajac I will close this issue as the slowdown is not as bad as mentioned (1.02x slower on cpu and 1.2x slower on gpu, see colab notebooks) and mostly comes from pytorch apparently.

adam515 · 2021-02-28T06:08:29Z

With pytorch 1.7.1 this became an issue again. I see slowdowns of over 50% with sb3 vs. sb2. Both for GPU and CPU training on equivalent model defs.

araffin · 2021-02-28T11:11:21Z

With pytorch 1.7.1 this became an issue again. I see slowdowns of over 50% with sb3 vs. sb2. Both for GPU and CPU training on equivalent model defs.

using the Colab notebook (linked above), i have:
"SB2 is 1.27x faster than SB3" (cpu)

and I have similar results on my latpop (cpu).

on GPU colab:
"SB2 is 1.57x faster than SB3"

araffin · 2021-03-05T20:09:45Z

Pytorch 1.8.0 seems to improve things:
"SB2 is 1.16x faster than SB3" (CPU, on colab)

"SB2 is 1.40x faster than SB3" (GPU, on colab)

araffin · 2022-03-11T12:12:53Z

Pytorch 1.11 (with longer training for better comparison);
"SB2 is 1.07x faster than SB3" (CPU, on colab)
"SB2 is 1.52x faster than SB3" (GPU, on colab)

This was referenced Jul 30, 2020

[Question/Discussion] Comparing stable-baselines3 vs stable-baselines #90

Closed

[Bug] Optimized Polyak update not equivalent for CnnPolicy #131

Closed

araffin closed this as completed Aug 3, 2020

araffin mentioned this issue Oct 19, 2022

SB2 vs SB3 - Performance difference #1124

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAC implementation is 2x slower than in stable-baselines #122

SAC implementation is 2x slower than in stable-baselines #122

michalzajac-ml commented Jul 24, 2020 •

edited

Loading

m-rph commented Jul 25, 2020

michalzajac-ml commented Jul 25, 2020

m-rph commented Jul 25, 2020

araffin commented Jul 25, 2020

araffin commented Jul 25, 2020 •

edited

Loading

araffin commented Jul 30, 2020 •

edited

Loading

araffin commented Aug 3, 2020

adam515 commented Feb 28, 2021

araffin commented Feb 28, 2021 •

edited

Loading

araffin commented Mar 5, 2021

araffin commented Mar 11, 2022

SAC implementation is 2x slower than in stable-baselines #122

SAC implementation is 2x slower than in stable-baselines #122

Comments

michalzajac-ml commented Jul 24, 2020 • edited Loading

m-rph commented Jul 25, 2020

michalzajac-ml commented Jul 25, 2020

m-rph commented Jul 25, 2020

araffin commented Jul 25, 2020

araffin commented Jul 25, 2020 • edited Loading

araffin commented Jul 30, 2020 • edited Loading

araffin commented Aug 3, 2020

adam515 commented Feb 28, 2021

araffin commented Feb 28, 2021 • edited Loading

araffin commented Mar 5, 2021

araffin commented Mar 11, 2022

michalzajac-ml commented Jul 24, 2020 •

edited

Loading

araffin commented Jul 25, 2020 •

edited

Loading

araffin commented Jul 30, 2020 •

edited

Loading

araffin commented Feb 28, 2021 •

edited

Loading