You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training a model with asr_librispeech script, i get a loss around 8 initially, with ddp i get around 8 with gradient accumulate as well; but when using deepspeed, with gradient accumulate=1 initial loss is 8, but with gradient accumulate=10 the loss value is 0.8; setting gradient accumulate in ds_config does nothing, setting gradient_accumulation_steps=10000 takes the same time as gradient_accumulation_steps=1
Error logs
loss=8 for gradient_accumulation_steps=1 and loss=0.8 gradient_accumulation_steps=10
Expected behavior
loss should on the same maginitute reguardless gradient_accumulation_steps
The text was updated successfully, but these errors were encountered:
Vindicator645
changed the title
loss value seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training
loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training
Oct 10, 2024
System Info
Nvidia A100
Information
🐛 Describe the bug
When training a model with asr_librispeech script, i get a loss around 8 initially, with ddp i get around 8 with gradient accumulate as well; but when using deepspeed, with gradient accumulate=1 initial loss is 8, but with gradient accumulate=10 the loss value is 0.8; setting gradient accumulate in ds_config does nothing, setting gradient_accumulation_steps=10000 takes the same time as gradient_accumulation_steps=1
Error logs
loss=8 for gradient_accumulation_steps=1 and loss=0.8 gradient_accumulation_steps=10
Expected behavior
loss should on the same maginitute reguardless gradient_accumulation_steps
The text was updated successfully, but these errors were encountered: