DeepSpeed Stage 3 in lightning leads to Nan and Inf values in the model parameters. #20534
Labels
bug
Something isn't working
ver: 2.5.x
waiting on author
Waiting on user action, correction, or update
Bug description
I try to use lightning with DeepSpeed stage 3 to train a model under the precision "16-mixed". However, I find that the model parameters includes Nan and Inf values at the first step. When I change it to DDP, this issue does not exist.
I initialize my trainer as:
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: