Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

Open
2 of 4 tasks
SilverSoldier opened this issue Jan 20, 2025 · 0 comments · May be fixed by #35580
Open
2 of 4 tasks

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

SilverSoldier opened this issue Jan 20, 2025 · 0 comments · May be fixed by #35580
Labels

Comments

@SilverSoldier
Copy link
Contributor

System Info

  • transformers version: 4.45.2
  • Platform: Linux-5.14.0-284.73.1.el9_2.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.9
  • Huggingface_hub version: 0.26.3
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: FSDP
  • Using GPU in script?: yes
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

Trainer: @muellerzr @SunMarc

Currently, the _save_checkpoint() method saves the model, optimizer (optionally) and finally the Trainer state.

The resume_from_checkpoint() function gets the checkpoint directory from the get_last_checkpoint function and loads the model and trainer state.

If training was stopped (or ended abruptly) in the middle of checkpointing, checkpoint directory (checkpoint-xx) is created but some of the files are missing. Auto-resume picks the directory for resuming from checkpoint but when loading the files this could throw an error. For ex. if the trainer state was not yet written, this throws a FileNotFound error during the TrainerState.load_from_json call and training is not able to resume. We'll need to manually delete the last directory to make it use the second last directory (a PytorchJob, for instance, will auto-resume the pod in case of failure but because of this issue, it cannot automatically resume from a failure and needs manual intervention).

We expect resume from checkpoint to pick the correct/complete checkpoint directory instead of throwing an error.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run accelerate launch (this is a sample command with small run time and higher checkpointing time):

accelerate launch --use_fsdp --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP --fsdp_forward_prefetch=false --fsdp_offload_params=false --fsdp_sharding_strategy=FULL_SHARD --fsdp_state_dict_type=FULL_STATE_DICT --fsdp_cpu_ram_efficient_loading=true --fsdp_sync_module_states=true --rdzv_backend=static --same_network --num_processes=4 --num_machines=${WORLD_SIZE} --mixed_precision=no --dynamo_backend=no --machine_rank=${RANK} --main_process_ip=${MASTER_ADDR} --main_process_port=${MASTER_PORT} -m tuning.sft_trainer --model_name_or_path bigscience/bloom-560m --training_data_path input.json --output_dir output_dir --packing false --response_template '\n### Response:' --dataset_text_field output --num_train_epochs 6.0 --max_seq_length 4096 --per_device_train_batch_size 30 --save_strategy epoch --logging_steps 1 --learning_rate 1e-5 --use_flash_attn false --validation_data_path validation.json --metric_for_best_model "loss" --load_best_model_at_end True --logging_strategy "steps" --per_device_eval_batch_size 10 --evaluation_strategy "epoch"

When logs show that it is writing the checkpoint, end the process with Ctrl-C. Then, again run the same command where it will try to resume from checkpoint. It will throw an error such as FileNotFoundError: [Errno 2] No such file or directory: 'output_dir/checkpoint-25/trainer_state.json' depending on which file is missing.

Expected behavior

If the last checkpoint is incomplete or not written fully, we expect training to resume from the checkpoint before instead of throwing an error.

I have raised a PR with a fix, which checks if model files and trainer state are available before choosing the directory for resuming from checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
1 participant