Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

half grads don't necessarily get released before the next forward #272

Open
NouamaneTazi opened this issue Jan 22, 2025 · 0 comments
Open
Labels
help wanted Extra attention is needed High Priority

Comments

@NouamaneTazi
Copy link
Member

NouamaneTazi commented Jan 22, 2025

Here's a plot summarizing the issue
Image

In the 2nd train_batch_iter: even when we do half_param.grad = None in _accumulate_grad, for some reason half grads don't get released until the end of the forward (specifically until after column_linear in the last cast_to_fp32 in nanotron/models/llama.py:906:forward_with_hidden_states)

Possible culprits:

  • Something in column_linear keeps reference of our half_grads
  • torch is not reliable to release grads before a new forward

update: Found another case (with batch_accum=3) where grads don't get cleared until end of forward -> culprit isn't cast_to_fp32?
Image

Adding torch.empty_cache in PipelineEngine.forward before the forward solves the issue
Image

but I think it's expensive to call empty_cache before every forward.

@NouamaneTazi NouamaneTazi added help wanted Extra attention is needed High Priority labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed High Priority
Projects
None yet
Development

No branches or pull requests

1 participant