Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Manually Controlling Actions During PPO Training #2014

Open
4 tasks done
wayne-weiwei opened this issue Sep 25, 2024 · 2 comments
Open
4 tasks done

[Question] Manually Controlling Actions During PPO Training #2014

wayne-weiwei opened this issue Sep 25, 2024 · 2 comments
Labels
check the checklist You have checked the required items in the checklist but you didn't do what is written... custom gym env Issue related to Custom Gym Env more information needed Please fill the issue template completely question Further information is requested

Comments

@wayne-weiwei
Copy link

❓ Question

Thank you very much for creating such an excellent tool. I am currently using the PPO algorithm in Stable-Baselines3 (SB3) for training in a custom environment. During this process, I encountered an issue that I would appreciate your guidance on.

When I call model.learn(total_timesteps=10e6), the PPO model blocks the current thread and focuses entirely on the learning process. However, this causes the communication within the environment to stop running during the training. I would like to manually control the actions during the training, similar to the following process:

action, _states = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)

Is there a way to continue training the PPO model while allowing manual control over the action selection, and keeping the environment’s communication running? Do you have any recommended solutions for this?
I greatly appreciate your time and any insights you can provide. Your work has been incredibly valuable, and I look forward to any suggestions you might have.

Checklist

@wayne-weiwei wayne-weiwei added the question Further information is requested label Sep 25, 2024
@araffin araffin added the more information needed Please fill the issue template completely label Sep 25, 2024
@araffin
Copy link
Member

araffin commented Oct 4, 2024

Hello,
this is hard to answer if you don't provide a minimal example to reproduce the behavior.
.learn() does two things (see docs): collect data and train the model (when it updates the model, no data is collected so that might be what you are seeing).

@wayne-weiwei
Copy link
Author

Thank you for the reply. When I set up a custom gym environment in Webots and used the following code for training:

env = Customer()  
check_env(env)  
    # Train
model = PPO('MlpPolicy', env, n_steps=2048, verbose=1)
model.learn(total_timesteps=10)

The algorithm did run, but it didn't work correctly in the Webots environment. The actions remained the same, and the reward never changed. However, after completing the training step, it appeared to finish normally. I'm wondering if I need to modify the learning process or if there’s something I might have missed in the environment setup.

@araffin araffin added custom gym env Issue related to Custom Gym Env check the checklist You have checked the required items in the checklist but you didn't do what is written... labels Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check the checklist You have checked the required items in the checklist but you didn't do what is written... custom gym env Issue related to Custom Gym Env more information needed Please fill the issue template completely question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants