Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Stage 3 in lightning leads to Nan and Inf values in the model parameters. #20534

Open
LittleFlyingSheep opened this issue Jan 7, 2025 · 2 comments
Labels
bug Something isn't working ver: 2.5.x waiting on author Waiting on user action, correction, or update

Comments

@LittleFlyingSheep
Copy link

Bug description

I try to use lightning with DeepSpeed stage 3 to train a model under the precision "16-mixed". However, I find that the model parameters includes Nan and Inf values at the first step. When I change it to DDP, this issue does not exist.

I initialize my trainer as:

trainer = Trainer(
        max_epochs=max_epochs,
        logger=logger,
        callbacks=[checkpoint_callback, lr_monitor],
        sync_batchnorm=sync_batchnorm,
        check_val_every_n_epoch=None,
        val_check_interval=every_n_train_steps * accumulate_grad_batches,
        devices="auto",
        accelerator="gpu",
        precision="16-mixed",
        strategy=deepspeed_stage_3,
        accumulate_grad_batches=accumulate_grad_batches, 
    )

What version are you seeing the problem on?

v2.5

How to reproduce the bug

trainer = Trainer(
        max_epochs=max_epochs,
        logger=logger,
        callbacks=[checkpoint_callback, lr_monitor],
        sync_batchnorm=sync_batchnorm,
        check_val_every_n_epoch=None,
        val_check_interval=every_n_train_steps * accumulate_grad_batches,
        devices="auto",
        accelerator="gpu",
        precision="16-mixed",
        strategy=deepspeed_stage_3accumulate_grad_batches=accumulate_grad_batches, 
    )

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

@LittleFlyingSheep LittleFlyingSheep added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jan 7, 2025
@lantiga
Copy link
Collaborator

lantiga commented Jan 7, 2025

hi @LittleFlyingSheep can you provide a minimal reproduction? it will help the investigation

@lantiga lantiga added waiting on author Waiting on user action, correction, or update and removed needs triage Waiting to be triaged by maintainers labels Jan 7, 2025
@LittleFlyingSheep
Copy link
Author

LittleFlyingSheep commented Jan 7, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ver: 2.5.x waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

2 participants