Unable to do fine tuning of parler-tts-large on float16 #189

ysant77 · 2025-01-16T06:21:58Z

Hi There,

Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:

01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - `prompt_attention_mask` is specified but `attention_mask` is not. A full `attention_mask` will be created. Make sure this is the intended behaviour.
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - `use_cache=True` is incompatible with gradient checkpointing`. Setting` use_cache=False`... 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of` past_key_values`is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of`EncoderDecoderCache`instead, e.g.`past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -` prompt_attention_mask`is specified but`attention_mask`is not. A full`attention_mask`will be created. Make sure this is the intended behaviour. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -`use_cache=True `is incompatible with gradient checkpointing`. Setting `use_cache=False`...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank0]: main()
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank0]: optimizer.step()
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank0]: self.scaler.step(self.optimizer, closure)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank0]: self.unscale_(optimizer)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank1]: main()
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank1]: self.unscale_(optimizer)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank1]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank1]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank1]: ValueError: Attempting to unscale FP16 gradients.
Train steps ... : 0%| | 0/52 [00:18<?, ?it/s]
E0114 02:59:22.646752 1129956 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1130038) of binary: /opt/conda/envs/new_audiomodel_env/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/new_audiomodel_env/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./training/run_parler_tts_training.py FAILED

Failures:
[1]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1130039)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1130038)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

My config details:

GCP with 2 T4 instances
torch dtype: float16

Define optimizer, LR scheduler, collator

optimizer = torch.optim.AdamW(
    #params=model.parameters(),
    params=[p for p in model.parameters() if p.requires_grad], #changed to fix the optimizer fp16 issue but not working!
    lr=training_args.learning_rate,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    weight_decay=training_args.weight_decay,
)

I did try to make above change as well so that it skips gradient clipping but didn't work. Your help would be much appreciated. When I tried bfloat16, I got CUDA out of memory error.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to do fine tuning of parler-tts-large on float16 #189

Unable to do fine tuning of parler-tts-large on float16 #189

ysant77 commented Jan 16, 2025

Unable to do fine tuning of parler-tts-large on float16 #189

Unable to do fine tuning of parler-tts-large on float16 #189

Comments

ysant77 commented Jan 16, 2025

./training/run_parler_tts_training.py FAILED

Failures: [1]: time : 2025-01-14_02:59:22 host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 1130039) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Define optimizer, LR scheduler, collator

Failures:
[1]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1130039)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html