Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to do fine tuning of parler-tts-large on float16 #189

Open
ysant77 opened this issue Jan 16, 2025 · 0 comments
Open

Unable to do fine tuning of parler-tts-large on float16 #189

ysant77 opened this issue Jan 16, 2025 · 0 comments

Comments

@ysant77
Copy link

ysant77 commented Jan 16, 2025

Hi There,

Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:

01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of past_key_values is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of EncoderDecoderCache instead, e.g. past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values).
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - prompt_attention_mask is specified but attention_mask is not. A full attention_mask will be created. Make sure this is the intended behaviour.
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of past_key_valuesis deprecated and will be removed in Transformers v4.43.0. You should pass an instance ofEncoderDecoderCacheinstead, e.g.past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values). 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - prompt_attention_maskis specified butattention_maskis not. A fullattention_maskwill be created. Make sure this is the intended behaviour. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank0]: main()
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank0]: optimizer.step()
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank0]: self.scaler.step(self.optimizer, closure)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank0]: self.unscale_(optimizer)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank1]: main()
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank1]: self.unscale_(optimizer)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank1]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank1]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank1]: ValueError: Attempting to unscale FP16 gradients.
Train steps ... : 0%| | 0/52 [00:18<?, ?it/s]
E0114 02:59:22.646752 1129956 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1130038) of binary: /opt/conda/envs/new_audiomodel_env/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/new_audiomodel_env/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./training/run_parler_tts_training.py FAILED

Failures:
[1]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1130039)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1130038)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

My config details:

GCP with 2 T4 instances
torch dtype: float16

Define optimizer, LR scheduler, collator

optimizer = torch.optim.AdamW(
    #params=model.parameters(),
    params=[p for p in model.parameters() if p.requires_grad], #changed to fix the optimizer fp16 issue but not working!
    lr=training_args.learning_rate,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    weight_decay=training_args.weight_decay,
)

I did try to make above change as well so that it skips gradient clipping but didn't work. Your help would be much appreciated. When I tried bfloat16, I got CUDA out of memory error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant