-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Non-numeric pressure" error (pressure = -nan) when attempting NPT in LAMMPS #53
Comments
Hi @samueldyoung29ctr , Hm, odd. If you start a simulation in LAMMPS from exactly a frame that shows up in your training data, or the one that shows a stress tensor in |
All of the starting geometries I launch from LAMMPS appear to show Diagnostic printing of virial tensor elementspair_allegro.cpp: if (vflag) {
torch::Tensor v_tensor = output.at("virial").toTensor().cpu();
auto v = v_tensor.accessor<outputtype, 3>();
// Convert from 3x3 symmetric tensor format, which NequIP outputs, to the flattened form LAMMPS expects
// First [0] index on v is batch
virial[0] = v[0][0][0];
virial[1] = v[0][1][1];
virial[2] = v[0][2][2];
virial[3] = v[0][0][1];
virial[4] = v[0][0][2];
virial[5] = v[0][1][2];
+ std::cout << "\tVirial Voigt vector: " << std::to_string(virial[0]) << ", " << std::to_string(virial[1]) << ", " << std::to_string(virial[2]) << ", " << std::to_string(virial[3]) << ", " << std::to_string(virial[4]) << ", " << std::to_string(virial[5]) << ".\n";
} pair_allegro_kokkos.cpp: if(vflag){
torch::Tensor v_tensor = output.at("virial").toTensor().cpu();
auto v = v_tensor.accessor<outputtype, 3>();
// Convert from 3x3 symmetric tensor format, which NequIP outputs, to the flattened form LAMMPS expects
// First [0] index on v is batch
this->virial[0] = v[0][0][0];
this->virial[1] = v[0][1][1];
this->virial[2] = v[0][2][2];
this->virial[3] = v[0][0][1];
this->virial[4] = v[0][0][2];
this->virial[5] = v[0][1][2];
+ std::cout << "\tVirial Voigt vector: " << std::to_string(this->virial[0]) << ", " << std::to_string(this->virial[1]) << ", " << std::to_string(this->virial[2]) << ", " << std::to_string(this->virial[3]) << ", " << std::to_string(this->virial[4]) << ", " << std::to_string(this->virial[5]) << ".\n";
} All virial components are Outputs from running 10 steps of NVEusing
and I can do NPT. Output when running
and the same error message when attempting NPT:
Looks like the workaround is to use the non-Kokkos Lines 451 to 455 in 20538c9
Happy to do some more testing if you'd like. |
Aha. I don't think we've ever actually tested Kokkos pair_allegro on CPU, nor am I sure we'd expect it to have any benefits over the OpenMP pair_allegro "plain" on CPU. @anjohan thoughts? Still, I guess we would have expected it to work...
|
Also, just to clarify, we should be training Allegro models on stresses in units of energy / length^3, right? E.g., for LAMMPS metal units, we should train Allegro on stresses in units of eV/ang^3, not units of bar? |
Update: I got around to compiling LAMMPS with CUDA support, but am still seeing this issue when using Kokkos to utilize the GPUs (NVIDIA A100-SXM4-40GB GPUs, CUDA 12.4 drivers installed).
My CMake config looks like this: CMake configcmake ../cmake \
-D LAMMPS_EXCEPTIONS=ON \
-D BUILD_SHARED_LIBS=ON \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-C ../cmake/presets/gcc.cmake \
-C ../cmake/presets/kokkos-cuda.cmake \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_ZEN3=yes \
-D Kokkos_ARCH_PASCAL60=no \
-D Kokkos_ARCH_AMPERE80=yes \
-D Kokkos_ENABLE_CUDA=yes \
-D Kokkos_ENABLE_OPENMP=yes \
-D CUFFT_LIBRARY=$CUDA_HOME/lib64/libcufft.so \
-D CUDA_INCLUDE_DIRS=$CUDA_HOME/include \
-D CUDA_CUDART_LIBRARY=$CUDA_HOME/lib64/libcudart.so \
-D CAFFE2_USE_CUDNN=ON \
-D BUILD_TOOLS=no \
-D FFT=FFTW3 \
-D FFT_KOKKOS=CUFFT \
-D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
-D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
-D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
-D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
-D PKG_MANYBODY=yes \
-D PKG_MOLECULE=yes \
-D PKG_KSPACE=yes \
-D PKG_REPLICA=yes \
-D PKG_ASPHERE=yes \
-D PKG_RIGID=yes \
-D PKG_MPIIO=yes \
-D PKG_COMPRESS=yes \
-D PKG_H5MD=no \
-D PKG_OPENMP=yes \
-D CMAKE_POSITION_INDEPENDENT_CODE=yes \
-D CMAKE_EXE_FLAGS="-dynamic" \
-D CMAKE_VERBOSE_MAKEFILE=TRUE \ I am building a non-debug version of LAMMPS because the build fails when I try to enable debugging symbols, no matter whether I use GCC or NVHPC compilers. The Allegro model I am using was trained using NequIP 0.6.1, mir-allegro 0.2.0, and PyTorch 1.11.0 (py3.10_cuda11.3_cudnn8.2.0_0). It was also trained on a NVIDIA A100-SXM4-40GB GPU. Allegro training configBesselBasis_trainable: true
PolynomialCutoff_p: 48
append: true
ase_args:
format: traj
avg_num_neighbors: auto
batch_size: 1
chemical_symbols:
- H
- O
dataset: ase
dataset_file_name: <path to train+val dataset as .traj file>
dataset_seed: 123456
default_dtype: float64
early_stopping_lower_bounds:
LR: 1.0e-05
early_stopping_patiences:
validation_loss: 100
early_stopping_upper_bounds:
cumulative_wall: 604800.0
edge_eng_mlp_initialization: uniform
edge_eng_mlp_latent_dimensions:
- 32
edge_eng_mlp_nonlinearity: null
ema_decay: 0.999
ema_use_num_updates: true
embed_initial_edge: true
env_embed_mlp_initialization: uniform
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
env_embed_multiplicity: 64
l_max: 2
latent_mlp_initialization: uniform
latent_mlp_latent_dimensions:
- 64
- 64
- 64
- 64
latent_mlp_nonlinearity: silu
latent_resnet: true
learning_rate: 0.001
loss_coeffs:
forces:
- 1
- PerSpeciesL1Loss
stress: 5000
total_energy:
- 20.0
- PerAtomL1Loss
lr_scheduler_kwargs:
cooldown: 0
eps: 1.0e-08
factor: 0.9
min_lr: 0
mode: min
patience: 400
threshold: 0.0001
threshold_mode: rel
verbose: false
lr_scheduler_name: ReduceLROnPlateau
max_epochs: 50000
metrics_components:
- - forces
- mae
- PerSpecies: true
report_per_component: false
- - forces
- rmse
- PerSpecies: true
report_per_component: false
- - forces
- rmse
- - forces
- mae
- - total_energy
- mae
- PerAtom: true
- - total_energy
- mae
- PerAtom: true
- - total_energy
- rmse
- PerAtom: true
- - stress
- mae
- - stress
- rmse
metrics_key: validation_loss
model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- StressForceOutput
- RescaleEnergyEtc
n_train: 2250
n_val: 250
num_layers: 2
optimizer_kwargs:
amsgrad: false
betas: !!python/tuple
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
optimizer_name: Adam
parity: o3_full
r_max: 4.0
root: results/wateronly
run_name: hpo-2f39852b9d648fa732723543b02d3ca4c3581ddc
seed: 123456
shuffle: true
train_val_split: random
two_body_latent_mlp_initialization: uniform
two_body_latent_mlp_latent_dimensions:
- 32
- 64
two_body_latent_mlp_nonlinearity: silu
use_ema: true
verbose: debug
wandb: true
wandb_project: <project name> After deploying the best model to TorchScript format, I attempted to use it in a LAMMPS NPT simulation. The input geometry is a water-only system. geom-thermalized-298.15K.data
There are no atom overlaps in this geometry, and the LAMMPS input script attempts to do NPT. input.lammps# LAMMPS script for our MD systems to validate Allegro potentials
# System-wide settings
units metal
dimension 3
atom_style atomic
boundary p p p
# System geometry
# initial_frame.data will be written into the working directory where this
# script is located.
read_data ./geom-thermalized-298.15K.data
# Simulation settings
mass 1 1.008
mass 2 15.999
pair_style allegro
pair_coeff * * ./hpo-2f39852b9d648fa732723543b02d3ca4c3581ddc.pth H O
# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all
# Logging
thermo 1
thermo_style custom step time temp press pe ke etotal epair ebond econserve fmax
# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes binsize 10.0
# Also try to specify larger cutoff for ghost atoms to avoid losing atoms.
comm_modify mode single cutoff 10.0 vel yes
# Try specifying initial velocities for all atoms
velocity all create 298.15 3127835 dist gaussian
# Run MD in the NPT ensemble, with a Nosé-Hoover thermostat starting at 298.15 K and a barostat starting at 1.01325 bar.
fix mynose all npt &
temp 298.15 298.15 0.011 &
tchain 3 &
iso 1.01325 1.01325 0.03
# Be sure to dump the MD trajectory
dump mdtraj all atom 40 mdtraj.lammpstrj
dump mdforces all custom 40 mdforces.lammpstrj id x y z vx vy vz fx fy fz
timestep 0.0005
# Set up binary restart dumps every 1000 steps in case something goes wrong.
restart 1000 step-*.restart
# Normal run, with a single balance first
balance 1.0 shift xyz 100 1.0
run 20000
undump mdtraj
undump mdforces
# Finally, write out the final geometry of the system
write_data geom-equilibrated-1atm.data I invoke LAMMPS with Kokkos: srun --cpu-bind=cores --gpu-bind=none lmp -k on g 4 -sf kk -pk kokkos neigh full newton on -in input.lammps This results in the following error: LAMMPS stdout
If I instead change to NVT like this: fix mynose all nvt &
temp 298.15 298.15 0.011 &
tchain 3 &
# iso 1.01325 1.01325 0.03 and again run with Kokkos the run starts, but with LAMMPS stdout with NVT, using Kokkos
If I take this same job, again using GPU LAMMPS, and run a CPU-only job without using Kokkos: srun --cpu-bind=cores --gpu-bind=none lmp -in input.lammps then pressures are calculated (although they are quite high): LAMMPS stdout with NVT, no Kokkos
Any advice on what to try? I have been using LAMMPS 02Aug23 since the folks at NERSC have used that version for their LAMMPS+pair_allegro installation, but is there a different LAMMPS release you recommend using? The admins on my cluster are also going to fix CUDA 12.4, so I should be able to build against more recent CUDA and related libraries in the next few weeks. Thanks! Update 13 Sep 2024: The problem and nan pressures persist even when compiling with the latest LAMMPS development branch (i.e., Git commit 2995cb7 from doing |
Update: I also tried running calculations using the LAMMPS + Kokkos + pair_allegro installation available on NERSC Perlmutter (i.e., the NERSC told me that this image was built using the following Dockerfile: Dockerfile for NERSC `nersc/lammps_allegro:23.08` imageFROM nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu22.04
WORKDIR /opt
ENV DEBIAN_FRONTEND noninteractive
RUN \
apt-get update && \
apt-get install --yes \
build-essential \
autoconf \
cmake \
flex \
bison \
zlib1g-dev \
fftw-dev \
fftw3 \
apbs \
libicu-dev \
libbz2-dev \
libboost-all-dev \
libgmp-dev \
bc \
libblas-dev \
liblapack-dev \
libfftw3-dev \
automake \
lsb-core \
libxc-dev \
git \
unzip \
clang \
llvm \
gcc \
g++ \
libgsl-dev \
libhdf5-serial-dev \
cmake \
intel-mkl-full \
vim \
python3 \
python3-pip \
mlocate \
wget && \
apt-get clean all
ARG mpich=4.1.1
ARG mpich_prefix=mpich-$mpich
RUN \
wget https://www.mpich.org/static/downloads/$mpich/$mpich_prefix.tar.gz && \
tar xvzf $mpich_prefix.tar.gz && \
cd $mpich_prefix && \
./configure FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch && \
make -j 16 && \
make install && \
make clean && \
cd .. && \
rm -rf $mpich_prefix
RUN /sbin/ldconfig
ENV MPI_PATH=/opt/mpich/install
ENV PATH=$PATH:/opt/mpich/install/bin
ENV PATH=$PATH:/opt/mpich/install/include
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mpich/install/lib
#RUN which mpicc
#RUN env MPICC=/opt/mpich/install/bin/mpicc python3 -m pip install mpi4py
# Install miniconda
ENV installer=Miniconda3-py39_4.12.0-Linux-x86_64.sh
RUN wget https://repo.anaconda.com/miniconda/$installer && \
/bin/bash $installer -b -p /opt/miniconda3 && \
rm -rf $installer
ENV PATH=/opt/miniconda3/bin:$PATH
RUN pip install numpy scipy matplotlib setuptools
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
RUN pip install mkl mkl-devel mkl-static mkl-include
RUN pip install ninja
RUN pip install wandb
#Installing lammps
WORKDIR /opt
RUN cd /opt
RUN git clone -b stable_2Aug2023_update2 --depth 1 https://github.com/lammps/lammps.git lammps
RUN git clone -b multicut https://github.com/mir-group/pair_allegro.git pair_allegro
RUN cd /opt/pair_allegro && \
./patch_lammps.sh /opt/lammps
RUN apt-get install --yes clang-format xxd
RUN wget https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.0.0%2Bcu118.zip
RUN unzip libtorch-cxx11-abi-shared-with-deps-2.0.0+cu118.zip
RUN rm -rf libtorch-cxx11-abi-shared-with-deps-2.0.0+cu118.zip
RUN mv libtorch libtorch-gpu
ENV PATH=$PATH:/opt/libtorch-gpu/bin
ENV PATH=$PATH:/opt/libtorch-gpu/include
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/libtorch-gpu/lib
ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0"
ENV CONDA_PREFIX="/opt/miniconda3"
ENV PATH=$PATH:/opt/lammps/build/plumed_build-prefix/bin
ENV PATH=$PATH:/opt/lammps/build/plumed_build-prefix/include
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/lammps/build/plumed_build-prefix/lib
ENV PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/lammps/build/plumed_build-prefix/lib/pkgconfig
ENV PLUMED_KERNEL=/opt/lammps/build/plumed_build-prefix/lib/libplumedKernel.so
WORKDIR /opt/lammps
RUN mkdir build
WORKDIR /opt/lammps/build
RUN cmake -DMKL_INCLUDE_DIR=$CONDA_PREFIX/include -DMKL_LIBRARY=$CONDA_PREFIX/lib -D CMAKE_BUILD_TYPE=Release \
-D CMAKE_PREFIX_PATH=/opt/libtorch-gpu \
-D CMAKE_INSTALL_PREFIX=/opt/lammps/install -D CMAKE_CXX_STANDARD=17 -D CMAKE_CXX_STANDARD_REQUIRED=ON \
-D BUILD_MPI=ON -D CMAKE_CXX_COMPILER=/opt/lammps/lib/kokkos/bin/nvcc_wrapper -D BUILD_SHARED_LIBS=ON \
-D PKG_MANYBODY=ON -D PKG_MOLECULE=ON -D PKG_KSPACE=ON -D PKG_REPLICA=ON -D PKG_REAXFF=ON -D PKG_QEQ=ON \
-D PKG_PHONON=ON -D PKG_ELECTRODE=yes -D PKG_PLUMED=yes -D DOWNLOAD_PLUMED=yes -D PLUMED_MODE=shared \
-D BUILD_SHARED_LIBS=ON -D PKG_KOKKOS=yes -D Kokkos_ARCH_AMPERE80=ON -D Kokkos_ENABLE_CUDA=yes \
-D CMAKE_PREFIX_PATH=/opt/libtorch-gpu ../cmake
RUN make -j 4
RUN make install
ENV PATH=/opt/lammps/install/bin:$PATH
ENV PATH=/opt/lammps/install/lib:$PATH
ENV PATH=/opt/lammps/install/include:$PATH
ENV LD_LIBRARY_PATH=/opt/lammps/install/lib:$LD_LIBRARY_PATH I am again finding that pressures are not correctly calculated when LAMMPS + pair_allegro is invoked with Kokkos flags. Slurm script to use containerized LAMMPS on NERSC with Kokkos#!/bin/bash
#SBATCH --image docker:nersc/lammps_allegro:23.08
#SBATCH --job-name=pressure-test
#SBATCH --account=mXXXX
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --constraint=gpu
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH --time=5:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append
# OpenMP parallelization
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited
# Run LAMMPS
exe="lmp"
input='-k on g 4 -sf kk -pk kokkos newton on neigh full -in input.lammps'
srun --cpu-bind=cores --gpu-bind=none --module mpich,gpu shifter $exe $input
Under NVT conditions when running with Kokkos, the thermo output show
and trying to do NPT with Kokkos results in the same error about non-numeric pressure:
However, when running LAMMPS without Kokkos, system pressures are calculated and the NPT simulation completes without issue. It appears that NPT calculations using multiple GPUs are not possible on Perlmutter. Slurm script to use containerized LAMMPS on NERSC without Kokkos#!/bin/bash
#SBATCH --image docker:nersc/lammps_allegro:23.08
#SBATCH --job-name=pressure-test
#SBATCH --account=mXXXX
#SBATCH --qos=regular
#SBATCH --nodes=1
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=2:00:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append
# OpenMP parallelization
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited
# Run LAMMPS
exe="lmp"
input='-in input.lammps'
srun --cpu-bind=cores --gpu-bind=none --module mpich shifter $exe $input It appears that you used Perlmutter GPU nodes for your scaling experiments in the SC'23 conference paper and A100 GPUs in the scaling testing of Allegro in the Nat. Comm. paper. Were the LAMMPS + pair_allegro builds for these experiments done before mir-allegro and pair_allegro got support for computing the virial tensor? Or was there a specific build configuration you used on Perlmutter to get virial tensor computation and thus support for NPT? Thanks! |
I'm trying to do NPT calculations in LAMMPS using
pair_allegro
, but the Allegro model I trained is predicting-nan
for the system pressure, so the NPT run fails. If I run under NVE or NVT, including thepress
property in the LAMMPS thermo logging, I see output like this:If I attempt a NPT calculation, like this
I immediately get this error:
I am training on a ASE dataset with stress information stored in units of energy / length^2:
and I can confirm that when using
nequip-evaluate
for inference on this same deployed model, I do get predicted stresses in the output file.This problem happens no matter whether I use
default_dtype: float32
in the config (andpair_style allegro3232
in the LAMMPS script) ordefault_dtype: float64
in the config (andpair_style allegro
in the LAMMPS script). I am using NequIP 0.6.1, mir-allegro 0.2.0, and PyTorch 1.11.0 (CUDA 11.3, cuDNN 8.2.0). I compiled LAMMPS 02Aug23 using pair_allegro commit 20538c9, which is the commit introducing support for stress. Details of my compilation of LAMMPS, an example of my training config (except thedefault_dtype
setting), and an example NVT LAMMPS input script are here.I am forcing deletion of any overlapping atoms prior to the NPT run, and do not see any indication when running under NVT or NVE that atoms are too close together, have very high forces, or are otherwise causing the simulation to go unstable. If I switch my LAMMPS input to use
pair_style lj/cut
, I am able to observe pressures in the thermo outputIs there something obvious I'm missing about how to get pair_allegro to pass the stress predictions from my Allegro models to LAMMPS?
Thanks!
The text was updated successfully, but these errors were encountered: