You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run a simulation using the mp0b small foundation model in LAMMPS. The system size is about 3k atoms. I have successfully run other smaller system size simulations (consisting of the same chemical elements) using the same model on LAMMPS. However, this new bigger model is running into "CUDA out of memory" error.
RuntimeError: CUDA out of memory. Tried to allocate 9.71 GiB. GPU 2 has a total capacity of 39.38 GiB of which 7.51 GiB is free. Including non-PyTorch memory, this process has 31.87 GiB memory in use. Of the allocated memory 28.80 GiB is allocated by PyTorch, and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
My initial hunch was that the memory requirement increases with the system size. I am not 100% certain though, so an answer/explanation about this assumption will be great. Based on that assumption, I tried to use more GPUs so that there are more memory for the calculation. Note, these are A100 gpus and tried to use 8 gpus. However, I still run into the same error.
This leads to several questions. Is MACE not able to handle distribution of memory among several gpus properly? Or is the total memory requirement already higher than what 8 gpus have (8*40 = 320 GiBs)? How can we calculate the memory requirements for a given system size (i.e., number of atoms)?
I am trying to run a simulation using the mp0b small foundation model in LAMMPS. The system size is about 3k atoms. I have successfully run other smaller system size simulations (consisting of the same chemical elements) using the same model on LAMMPS. However, this new bigger model is running into "CUDA out of memory" error.
RuntimeError: CUDA out of memory. Tried to allocate 9.71 GiB. GPU 2 has a total capacity of 39.38 GiB of which 7.51 GiB is free. Including non-PyTorch memory, this process has 31.87 GiB memory in use. Of the allocated memory 28.80 GiB is allocated by PyTorch, and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
My initial hunch was that the memory requirement increases with the system size. I am not 100% certain though, so an answer/explanation about this assumption will be great. Based on that assumption, I tried to use more GPUs so that there are more memory for the calculation. Note, these are A100 gpus and tried to use 8 gpus. However, I still run into the same error.
This leads to several questions. Is MACE not able to handle distribution of memory among several gpus properly? Or is the total memory requirement already higher than what 8 gpus have (8*40 = 320 GiBs)? How can we calculate the memory requirements for a given system size (i.e., number of atoms)?
Or am I doing something wrong? The MACE-OFF23 paper (MACE-OFF23: Transferable machine learning force fields for organic molecules) mentions a few simulations with 18,000 atoms or 5184 Atoms/GPU (Figure 10). But no details are provided for those simulations.
The text was updated successfully, but these errors were encountered: