You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of past_key_values is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of EncoderDecoderCache instead, e.g. past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values).
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - prompt_attention_mask is specified but attention_mask is not. A full attention_mask will be created. Make sure this is the intended behaviour.
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of past_key_valuesis deprecated and will be removed in Transformers v4.43.0. You should pass an instance ofEncoderDecoderCacheinstead, e.g.past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values). 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - prompt_attention_maskis specified butattention_maskis not. A fullattention_maskwill be created. Make sure this is the intended behaviour. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank0]: main()
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank0]: optimizer.step()
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank0]: self.scaler.step(self.optimizer, closure)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank0]: self.unscale_(optimizer)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank1]: main()
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank1]: self.unscale_(optimizer)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank1]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank1]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank1]: ValueError: Attempting to unscale FP16 gradients.
Train steps ... : 0%| | 0/52 [00:18<?, ?it/s]
E0114 02:59:22.646752 1129956 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1130038) of binary: /opt/conda/envs/new_audiomodel_env/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/new_audiomodel_env/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
optimizer = torch.optim.AdamW(
#params=model.parameters(),
params=[p for p in model.parameters() if p.requires_grad], #changed to fix the optimizer fp16 issue but not working!
lr=training_args.learning_rate,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
weight_decay=training_args.weight_decay,
)
I did try to make above change as well so that it skips gradient clipping but didn't work. Your help would be much appreciated. When I tried bfloat16, I got CUDA out of memory error.
The text was updated successfully, but these errors were encountered:
Hi There,
Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of
past_key_values
is deprecated and will be removed in Transformers v4.43.0. You should pass an instance ofEncoderDecoderCache
instead, e.g.past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)
.01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -
prompt_attention_mask
is specified butattention_mask
is not. A fullattention_mask
will be created. Make sure this is the intended behaviour.01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -
use_cache=True
is incompatible with gradient checkpointing. Setting
use_cache=False... 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of
past_key_valuesis deprecated and will be removed in Transformers v4.43.0. You should pass an instance of
EncoderDecoderCacheinstead, e.g.
past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values). 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -
prompt_attention_maskis specified but
attention_maskis not. A full
attention_maskwill be created. Make sure this is the intended behaviour. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -
use_cache=Trueis incompatible with gradient checkpointing
. Settinguse_cache=False
...[rank0]: Traceback (most recent call last):
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank0]: main()
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank0]: optimizer.step()
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank0]: self.scaler.step(self.optimizer, closure)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank0]: self.unscale_(optimizer)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank1]: main()
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank1]: self.unscale_(optimizer)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank1]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank1]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank1]: ValueError: Attempting to unscale FP16 gradients.
Train steps ... : 0%| | 0/52 [00:18<?, ?it/s]
E0114 02:59:22.646752 1129956 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1130038) of binary: /opt/conda/envs/new_audiomodel_env/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/new_audiomodel_env/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./training/run_parler_tts_training.py FAILED
Failures:
[1]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1130039)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1130038)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
My config details:
GCP with 2 T4 instances
torch dtype: float16
Define optimizer, LR scheduler, collator
The text was updated successfully, but these errors were encountered: