loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training #144

Vindicator645 · 2024-10-10T09:13:45Z

System Info

Nvidia A100

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

When training a model with asr_librispeech script, i get a loss around 8 initially, with ddp i get around 8 with gradient accumulate as well; but when using deepspeed, with gradient accumulate=1 initial loss is 8, but with gradient accumulate=10 the loss value is 0.8; setting gradient accumulate in ds_config does nothing, setting gradient_accumulation_steps=10000 takes the same time as gradient_accumulation_steps=1

Error logs

loss=8 for gradient_accumulation_steps=1 and loss=0.8 gradient_accumulation_steps=10

Expected behavior

loss should on the same maginitute reguardless gradient_accumulation_steps

Vindicator645 · 2024-10-11T01:15:49Z

I suspect the loss = loss / gradient_accumulation_steps and acc = acc / gradient_accumulation_steps should be removed in deepspeed_utils

fclearner · 2024-10-29T06:04:16Z

try this:
model_engine.backward(loss)

if (step + 1) % model_engine.gradient_accumulation_steps() == 0:
model_engine.step()
model_engine.zero_grad()

fclearner · 2024-10-31T01:09:42Z

try this: model_engine.backward(loss)

if (step + 1) % model_engine.gradient_accumulation_steps() == 0: model_engine.step() model_engine.zero_grad()

sorry, remove gradient_accumulation_steps is enough

Vindicator645 changed the title ~~loss value seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training~~ loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training #144

loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training #144

Vindicator645 commented Oct 10, 2024

Vindicator645 commented Oct 11, 2024

fclearner commented Oct 29, 2024

fclearner commented Oct 31, 2024

loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training #144

loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training #144

Comments

Vindicator645 commented Oct 10, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

Vindicator645 commented Oct 11, 2024

fclearner commented Oct 29, 2024

fclearner commented Oct 31, 2024