You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There have been reports that running a long job on Derecho with a large domain will cause the first rank to use up all the memory and crash.
Image below was made using ARM Linaro Forge to give an example of the memory pattern across a node.
Expected Behavior
Large domain jobs runs successfully for extended periods of time.
Current Behavior
Crashes after memory usage on the first rank continually rises.
Possible Solution
I've used Valgrind to track down some memory warnings/errors and made some possible fixes (see debug/valgrind-errors branch. More testing needs to be done to see if they fix the issue.
Steps to Reproduce (for bugs)
Use large domain
Run a long time
Track memory usage on first rank, does it stop or continue to grow until crash?
Your Environment
Version of the code used: main branch
Operating System and version: Derecho, running with one node
Compiler and version: Cray 16.0.1
The text was updated successfully, but these errors were encountered:
There have been reports that running a long job on Derecho with a large domain will cause the first rank to use up all the memory and crash.
Image below was made using
ARMLinaro Forge to give an example of the memory pattern across a node.Expected Behavior
Large domain jobs runs successfully for extended periods of time.
Current Behavior
Crashes after memory usage on the first rank continually rises.
Possible Solution
I've used Valgrind to track down some memory warnings/errors and made some possible fixes (see debug/valgrind-errors branch. More testing needs to be done to see if they fix the issue.
Steps to Reproduce (for bugs)
Your Environment
main branch
Derecho
, running with one nodeCray 16.0.1
The text was updated successfully, but these errors were encountered: