RuntTime error while pretraining LLaMa from scratch using the pretraining script #213

dopu2k16 · 2023-08-31T01:06:43Z

dopu2k16
Aug 31, 2023

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

torchrun --nnodes 1 --nproc_per_node 1 run_clm_llama_pretraining_peft.py
--deepspeed ${deepspeed_config_file}
--model_type ${model_type}
--tokenizer_name_or_path ${tokenizer_path_lang_vac}
--dataset_dir ${dataset_dir_2}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.1
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--do_eval
--seed 42
--fp16
--num_train_epochs 1
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 2
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--block_size 512
--output_dir ${output_dir}
--overwrite_output_dir True
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--torch_dtype bfloat16
--gradient_checkpointing True
--ddp_find_unused_parameters False

iMountTai · 2023-08-31T12:11:53Z

iMountTai
Aug 31, 2023
Collaborator

git pullthe latest code and try again.

2 replies

dopu2k16 Sep 3, 2023
Author

@iMountTai Thank you. However the loss is not decreasing much... how to use gradient clipping here? Can you tell me what's going wrong here in the pretraining stage using the script?

[INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64

[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32, reducing to 16

[loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0, but hysteresis is 2. Reducing hysteresis to 1
loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0, reducing to 0

lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05
model_type="llama"
tokenizer_path_lang_vac="./vac_bpe.model"
dataset_dir_2=./data/vac/
data_cache=temp_data_cache_dir
per_device_train_batch_size=4
per_device_eval_batch_size=1
gradient_accumulation_steps=8
output_dir=vac_llama_peft_scratch_output_dir

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_llama_pretraining_peft.py
--deepspeed ${deepspeed_config_file}
--model_type ${model_type}
--tokenizer_name_or_path ${tokenizer_path_lang_vac}
--dataset_dir ${dataset_dir_2}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.05
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--do_eval
--seed 42
--fp16
--num_train_epochs 1
--lr_scheduler_type linear
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 2
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--block_size 512
--output_dir ${output_dir}
--overwrite_output_dir True
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--torch_dtype bfloat16
--gradient_checkpointing True
--ddp_find_unused_parameters False

dopu2k16 Sep 3, 2023
Author

Even with float16, I am getting the similar issue

[2023-09-03 22:41:56,720] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1

dopu2k16 · 2023-09-03T21:12:50Z

dopu2k16
Sep 3, 2023
Author

@iMountTai Hi, I was able to resolve it and now I don't have Overflow issues. Thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntTime error while pretraining LLaMa from scratch using the pretraining script #213

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RuntTime error while pretraining LLaMa from scratch using the pretraining script #213

dopu2k16 Aug 31, 2023

Replies: 2 comments · 2 replies

iMountTai Aug 31, 2023 Collaborator

dopu2k16 Sep 3, 2023 Author

dopu2k16 Sep 3, 2023 Author

dopu2k16 Sep 3, 2023 Author

dopu2k16
Aug 31, 2023

Replies: 2 comments 2 replies

iMountTai
Aug 31, 2023
Collaborator

dopu2k16 Sep 3, 2023
Author

dopu2k16 Sep 3, 2023
Author

dopu2k16
Sep 3, 2023
Author