Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU run on LAMMPS facing "CUDA out of memory" error #777

Open
fbhuiyan2 opened this issue Jan 12, 2025 · 0 comments
Open

Multi-GPU run on LAMMPS facing "CUDA out of memory" error #777

fbhuiyan2 opened this issue Jan 12, 2025 · 0 comments
Labels

Comments

@fbhuiyan2
Copy link

fbhuiyan2 commented Jan 12, 2025

I am trying to run a simulation using the mp0b small foundation model in LAMMPS. The system size is about 3k atoms. I have successfully run other smaller system size simulations (consisting of the same chemical elements) using the same model on LAMMPS. However, this new bigger model is running into "CUDA out of memory" error.

RuntimeError: CUDA out of memory. Tried to allocate 9.71 GiB. GPU 2 has a total capacity of 39.38 GiB of which 7.51 GiB is free. Including non-PyTorch memory, this process has 31.87 GiB memory in use. Of the allocated memory 28.80 GiB is allocated by PyTorch, and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

My initial hunch was that the memory requirement increases with the system size. I am not 100% certain though, so an answer/explanation about this assumption will be great. Based on that assumption, I tried to use more GPUs so that there are more memory for the calculation. Note, these are A100 gpus and tried to use 8 gpus. However, I still run into the same error.

This leads to several questions. Is MACE not able to handle distribution of memory among several gpus properly? Or is the total memory requirement already higher than what 8 gpus have (8*40 = 320 GiBs)? How can we calculate the memory requirements for a given system size (i.e., number of atoms)?

Or am I doing something wrong? The MACE-OFF23 paper (MACE-OFF23: Transferable machine learning force fields for organic molecules) mentions a few simulations with 18,000 atoms or 5184 Atoms/GPU (Figure 10). But no details are provided for those simulations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants