[Question] Manually Controlling Actions During PPO Training #2014

wayne-weiwei · 2024-09-25T13:35:24Z

❓ Question

Thank you very much for creating such an excellent tool. I am currently using the PPO algorithm in Stable-Baselines3 (SB3) for training in a custom environment. During this process, I encountered an issue that I would appreciate your guidance on.

When I call model.learn(total_timesteps=10e6), the PPO model blocks the current thread and focuses entirely on the learning process. However, this causes the communication within the environment to stop running during the training. I would like to manually control the actions during the training, similar to the following process:

action, _states = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)

Is there a way to continue training the PPO model while allowing manual control over the action selection, and keeping the environment’s communication running? Do you have any recommended solutions for this?
I greatly appreciate your time and any insights you can provide. Your work has been incredibly valuable, and I look forward to any suggestions you might have.

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2024-10-04T06:49:22Z

Hello,
this is hard to answer if you don't provide a minimal example to reproduce the behavior.
.learn() does two things (see docs): collect data and train the model (when it updates the model, no data is collected so that might be what you are seeing).

wayne-weiwei · 2024-10-05T10:40:42Z

Thank you for the reply. When I set up a custom gym environment in Webots and used the following code for training:

env = Customer()  
check_env(env)  
    # Train
model = PPO('MlpPolicy', env, n_steps=2048, verbose=1)
model.learn(total_timesteps=10)

The algorithm did run, but it didn't work correctly in the Webots environment. The actions remained the same, and the reward never changed. However, after completing the training step, it appeared to finish normally. I'm wondering if I need to modify the learning process or if there’s something I might have missed in the environment setup.

wayne-weiwei added the question Further information is requested label Sep 25, 2024

araffin added the more information needed Please fill the issue template completely label Sep 25, 2024

araffin added custom gym env Issue related to Custom Gym Env check the checklist You have checked the required items in the checklist but you didn't do what is written... labels Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Manually Controlling Actions During PPO Training #2014

[Question] Manually Controlling Actions During PPO Training #2014

wayne-weiwei commented Sep 25, 2024

araffin commented Oct 4, 2024

wayne-weiwei commented Oct 5, 2024

[Question] Manually Controlling Actions During PPO Training #2014

[Question] Manually Controlling Actions During PPO Training #2014

Comments

wayne-weiwei commented Sep 25, 2024

❓ Question

Checklist

araffin commented Oct 4, 2024

wayne-weiwei commented Oct 5, 2024