-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training crashes after 50 epochs #290
Comments
Some users started seeing similar behavior to this, so I added this workaround to the README:
Could it be the root of your issue too? I am assuming this i a multigpu training. I do not remember the error being as consistent as you say (always 50 steps), so it might be unrelated. torchmd-net/torchmdnet/data.py Lines 132 to 139 in 166b7db
It would be great if you could try |
Thanks! Yes, this is with multiple GPUs. I just started a run with |
Crossing my fingers, but I think |
My training runs always crash after exactly 50 epochs. Looking at the log, there are many repetitions of this error:
and then it finally exits with this error:
Any idea what could be causing it?
The text was updated successfully, but these errors were encountered: