You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the _save_checkpoint() method saves the model, optimizer (optionally) and finally the Trainer state.
The resume_from_checkpoint() function gets the checkpoint directory from the get_last_checkpoint function and loads the model and trainer state.
If training was stopped (or ended abruptly) in the middle of checkpointing, checkpoint directory (checkpoint-xx) is created but some of the files are missing. Auto-resume picks the directory for resuming from checkpoint but when loading the files this could throw an error. For ex. if the trainer state was not yet written, this throws a FileNotFound error during the TrainerState.load_from_json call and training is not able to resume. We'll need to manually delete the last directory to make it use the second last directory (a PytorchJob, for instance, will auto-resume the pod in case of failure but because of this issue, it cannot automatically resume from a failure and needs manual intervention).
We expect resume from checkpoint to pick the correct/complete checkpoint directory instead of throwing an error.
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Run accelerate launch (this is a sample command with small run time and higher checkpointing time):
When logs show that it is writing the checkpoint, end the process with Ctrl-C. Then, again run the same command where it will try to resume from checkpoint. It will throw an error such as FileNotFoundError: [Errno 2] No such file or directory: 'output_dir/checkpoint-25/trainer_state.json' depending on which file is missing.
Expected behavior
If the last checkpoint is incomplete or not written fully, we expect training to resume from the checkpoint before instead of throwing an error.
I have raised a PR with a fix, which checks if model files and trainer state are available before choosing the directory for resuming from checkpoint.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.45.2Who can help?
Trainer: @muellerzr @SunMarc
Currently, the _save_checkpoint() method saves the model, optimizer (optionally) and finally the Trainer state.
The resume_from_checkpoint() function gets the checkpoint directory from the
get_last_checkpoint
function and loads the model and trainer state.If training was stopped (or ended abruptly) in the middle of checkpointing, checkpoint directory (checkpoint-xx) is created but some of the files are missing. Auto-resume picks the directory for resuming from checkpoint but when loading the files this could throw an error. For ex. if the trainer state was not yet written, this throws a FileNotFound error during the TrainerState.load_from_json call and training is not able to resume. We'll need to manually delete the last directory to make it use the second last directory (a PytorchJob, for instance, will auto-resume the pod in case of failure but because of this issue, it cannot automatically resume from a failure and needs manual intervention).
We expect resume from checkpoint to pick the correct/complete checkpoint directory instead of throwing an error.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run accelerate launch (this is a sample command with small run time and higher checkpointing time):
When logs show that it is writing the checkpoint, end the process with Ctrl-C. Then, again run the same command where it will try to resume from checkpoint. It will throw an error such as
FileNotFoundError: [Errno 2] No such file or directory: 'output_dir/checkpoint-25/trainer_state.json'
depending on which file is missing.Expected behavior
If the last checkpoint is incomplete or not written fully, we expect training to resume from the checkpoint before instead of throwing an error.
I have raised a PR with a fix, which checks if model files and trainer state are available before choosing the directory for resuming from checkpoint.
The text was updated successfully, but these errors were encountered: