Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training #144

Open
1 of 2 tasks
Vindicator645 opened this issue Oct 10, 2024 · 3 comments

Comments

@Vindicator645
Copy link

System Info

Nvidia A100

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

When training a model with asr_librispeech script, i get a loss around 8 initially, with ddp i get around 8 with gradient accumulate as well; but when using deepspeed, with gradient accumulate=1 initial loss is 8, but with gradient accumulate=10 the loss value is 0.8; setting gradient accumulate in ds_config does nothing, setting gradient_accumulation_steps=10000 takes the same time as gradient_accumulation_steps=1

Error logs

loss=8 for gradient_accumulation_steps=1 and loss=0.8 gradient_accumulation_steps=10

Expected behavior

loss should on the same maginitute reguardless gradient_accumulation_steps

@Vindicator645 Vindicator645 changed the title loss value seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training Oct 10, 2024
@Vindicator645
Copy link
Author

I suspect the loss = loss / gradient_accumulation_steps and acc = acc / gradient_accumulation_steps should be removed in deepspeed_utils

@fclearner
Copy link

try this:
model_engine.backward(loss)

if (step + 1) % model_engine.gradient_accumulation_steps() == 0:
model_engine.step()
model_engine.zero_grad()

@fclearner
Copy link

try this: model_engine.backward(loss)

if (step + 1) % model_engine.gradient_accumulation_steps() == 0: model_engine.step() model_engine.zero_grad()

sorry, remove gradient_accumulation_steps is enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants