Possibility to resume training #692

a-z-e-r-i-l-a · 2020-02-17T09:11:07Z

Is there a way to resume training, for example if our PC crashes or we face memory issues?

Could saving the "model" object as a pickle file in every step and using the "learn" function a way to resume a training? (if so, can I make a pull request for it)

Miffyli · 2020-02-17T10:30:42Z

I am not sure if I follow here. Yes, you can save models at any point of training (via callbacks), load models and resume training, as shown in this example.

a-z-e-r-i-l-a · 2020-02-17T10:38:42Z

So I figured out that with the callback function we can save the model parameters, I was just not completely sure if the training will resume or continue completely similar to if it wouldn't have been stopped in the first place. I resume training model which was saved with the "save" function, like this:

model = SAC.load("sac_model")
model.env = some_physics_based_gym_environment()
model.learn(total_timesteps=50000, callback=resume_callback)

Does this continue the training exacltly as if it was not stopped in the first place at that point when it was saved?
and my final question was also that if the tensorboard log also continues as it was, or if it gets reinitialized without showing the history of the earlier training.

thanks.

Miffyli · 2020-02-17T12:53:03Z

Ah yes, this is a valid question.

Answer is no, no it does not continue exactly as without saving and loading. Most notably, optimizer parameters are not stored along the model, and schedulers for learning rates and such start from zero again upon new call to learn.

As for Tensorboard being updated: I have not tried this, but others seem to have issues with (e.g. #599). I am not sure how the code is supposed to function in this case when you re-use the same name.

a-z-e-r-i-l-a · 2020-02-17T13:05:06Z

I think the only way is saving the complete model object perhaps right? then probably with some changes in the "learn" function, one can resume from a previously learning process. Do you think this could be a pull request to make?

Edit:
Saving the complete model object doesn't seem very easy with pickle, it gives this error:
TypeError: can't pickle _thread.lock objects

Miffyli · 2020-02-17T13:18:20Z

It would not be as easy, as models contain a bunch of un-pickable objects and Tensorflow variables are not included in the pickling process by default. Also, as mentioned earlier, the learn method does some initializations upon every call to it, which could also cause differences, not to mention all the non-determinism that could spur up.

We could design next version of stable-baselines to support this "continue as if it was never stopped" behavior, where it should be easier with eager-type of computations and graphs.

Edit: Yup, pickling or serialization in general picks specific variables to store for this reason.

araffin · 2020-02-17T13:26:40Z

Related #301

yjc765 · 2020-02-20T15:48:17Z

This example maybe help you

araffin · 2020-03-07T14:08:43Z

closing this one in favor of #301

rambo1111 · 2024-02-03T19:23:21Z

#1192

Miffyli added the question Further information is requested label Feb 17, 2020

araffin added the RTFM Answer is the documentation label Feb 17, 2020

araffin removed the RTFM Answer is the documentation label Feb 17, 2020

smorad mentioned this issue Mar 2, 2020

Custom policy may not load correctly #717

Closed

araffin closed this as completed Mar 7, 2020

araffin mentioned this issue Apr 3, 2020

[PPO2] problems resuming training #781

Open

kmsgnnew mentioned this issue Feb 22, 2021

[Question] Resume of training from saved model does not give similar result DLR-RM/stable-baselines3#326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to resume training #692

Possibility to resume training #692

a-z-e-r-i-l-a commented Feb 17, 2020 •

edited

Loading

Miffyli commented Feb 17, 2020

a-z-e-r-i-l-a commented Feb 17, 2020

Miffyli commented Feb 17, 2020

a-z-e-r-i-l-a commented Feb 17, 2020 •

edited

Loading

Miffyli commented Feb 17, 2020 •

edited

Loading

araffin commented Feb 17, 2020

yjc765 commented Feb 20, 2020 •

edited

Loading

araffin commented Mar 7, 2020

rambo1111 commented Feb 3, 2024

Possibility to resume training #692

Possibility to resume training #692

Comments

a-z-e-r-i-l-a commented Feb 17, 2020 • edited Loading

Miffyli commented Feb 17, 2020

a-z-e-r-i-l-a commented Feb 17, 2020

Miffyli commented Feb 17, 2020

a-z-e-r-i-l-a commented Feb 17, 2020 • edited Loading

Miffyli commented Feb 17, 2020 • edited Loading

araffin commented Feb 17, 2020

yjc765 commented Feb 20, 2020 • edited Loading

araffin commented Mar 7, 2020

rambo1111 commented Feb 3, 2024

a-z-e-r-i-l-a commented Feb 17, 2020 •

edited

Loading

a-z-e-r-i-l-a commented Feb 17, 2020 •

edited

Loading

Miffyli commented Feb 17, 2020 •

edited

Loading

yjc765 commented Feb 20, 2020 •

edited

Loading