Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

SilverSoldier · 2025-01-20T08:16:54Z

System Info

transformers version: 4.45.2
Platform: Linux-5.14.0-284.73.1.el9_2.x86_64-x86_64-with-glibc2.34
Python version: 3.11.9
Huggingface_hub version: 0.26.3
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: FSDP
Using GPU in script?: yes
GPU type: NVIDIA A100-SXM4-80GB

Who can help?

Currently, the _save_checkpoint() method saves the model, optimizer (optionally) and finally the Trainer state.

The resume_from_checkpoint() function gets the checkpoint directory from the get_last_checkpoint function and loads the model and trainer state.

If training was stopped (or ended abruptly) in the middle of checkpointing, checkpoint directory (checkpoint-xx) is created but some of the files are missing. Auto-resume picks the directory for resuming from checkpoint but when loading the files this could throw an error. For ex. if the trainer state was not yet written, this throws a FileNotFound error during the TrainerState.load_from_json call and training is not able to resume. We'll need to manually delete the last directory to make it use the second last directory (a PytorchJob, for instance, will auto-resume the pod in case of failure but because of this issue, it cannot automatically resume from a failure and needs manual intervention).

We expect resume from checkpoint to pick the correct/complete checkpoint directory instead of throwing an error.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run accelerate launch (this is a sample command with small run time and higher checkpointing time):

accelerate launch --use_fsdp --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP --fsdp_forward_prefetch=false --fsdp_offload_params=false --fsdp_sharding_strategy=FULL_SHARD --fsdp_state_dict_type=FULL_STATE_DICT --fsdp_cpu_ram_efficient_loading=true --fsdp_sync_module_states=true --rdzv_backend=static --same_network --num_processes=4 --num_machines=${WORLD_SIZE} --mixed_precision=no --dynamo_backend=no --machine_rank=${RANK} --main_process_ip=${MASTER_ADDR} --main_process_port=${MASTER_PORT} -m tuning.sft_trainer --model_name_or_path bigscience/bloom-560m --training_data_path input.json --output_dir output_dir --packing false --response_template '\n### Response:' --dataset_text_field output --num_train_epochs 6.0 --max_seq_length 4096 --per_device_train_batch_size 30 --save_strategy epoch --logging_steps 1 --learning_rate 1e-5 --use_flash_attn false --validation_data_path validation.json --metric_for_best_model "loss" --load_best_model_at_end True --logging_strategy "steps" --per_device_eval_batch_size 10 --evaluation_strategy "epoch"

When logs show that it is writing the checkpoint, end the process with Ctrl-C. Then, again run the same command where it will try to resume from checkpoint. It will throw an error such as FileNotFoundError: [Errno 2] No such file or directory: 'output_dir/checkpoint-25/trainer_state.json' depending on which file is missing.

Expected behavior

If the last checkpoint is incomplete or not written fully, we expect training to resume from the checkpoint before instead of throwing an error.

I have raised a PR with a fix, which checks if model files and trainer state are available before choosing the directory for resuming from checkpoint.

The text was updated successfully, but these errors were encountered:

SilverSoldier added the bug label Jan 20, 2025

SilverSoldier linked a pull request Jan 20, 2025 that will close this issue

Save checkpoint to temporary directory to handle partial saves during failures #35580

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

SilverSoldier commented Jan 20, 2025

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

Comments

SilverSoldier commented Jan 20, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior