-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This wiki documents the steps required to run code for the Short Baseline Neutrino (SBN) program on the Polaris supercomputer at Argonne National Laboratory.
SBN-related code is built on the LArSoft framework. The libraries required to build and run LArSoft are copied from manifests available at https://scisoft.fnal.gov/ and installed in a Scientific Linux 7 container.
Once LArSoft is installed, it becomes possible to load experiment-specific software via ups
in the same was as on Fermilab virtual machines, e.g.,
source ${LARSOFT_ROOT}/setup
setup sbndcode v09_75_03_02 -q e20:prof
Disk resources required to run the code are divided into two filesystems available on Polaris: eagle
and grand
. The grand
filesystem contains compiled code and input files, while eagle
is used for outputs and transfers.
-
Request a user account on Polaris with access to
neutrinoGPU
project. - Once logged in, create a local Conda environment and install
parsl
:module load conda/2023-10-04 conda activate conda create -n sbn python=3.10 conda activate sbn pip install parsl
- Clone the sbnd_parsl repository to your home directory. Modify the
entry_point.py
program to adjust the list of.fcl
files, change submission configuration options, etc. - Submit jobs by running the
entry_point.py
program, e.g.python sbnd_parsl/entry_point.py -o /lus/eagle/projects/neutrinoGPU/my-production
You can test software within an interactive job. To begin an interactive job, create a script called interactive_job.sh
with the following contents and run it:
#!/bin/sh
# Start an interactive job
ALLOCATION="neutrinoGPU"
FILESYSTEM="home:grand:eagle"
qsub -I -l select=1 -l walltime=0:45:00 -q debug \
-A "${ALLOCATION}" -l filesystems="${FILESYSTEM}"
Once a slot on the debug
queue becomes available, you will be automatically connected to a prompt within the interactive job.
- The program
pbsq
is installed at/lus/grand/projects/neutrinoGPU/software/pbsq
. It produces more readable output about job status and can be invoked withpbsq -f neutrinoGPU
. - Once your job is running you can
ssh
into the worker node. Get the node withqstat -u $(whoami)
or viapbsq
, it should start with "x." Once connected, you can check the memory usage and other metrics with e.g.cat /proc/meminfo
. - Individual job history can be checked with
qstat -xf <jobid>
- You can log in to Polaris once via an ssh tunnel, and allow future
ssh
connections to connect without requiring authentication. Place the function in your computer's local.bashrc
or.zshrc
file:connect_polaris () { # macOS (BSD-based ps) # s=$(ps -Ao user,pid,%cpu,%mem,vsz,rss,tt,stat,start,time,command \ # | grep $(whoami) | sed -e 's/sshd//g' | grep ssh | grep fNT | grep polaris) # Unix s=$(ps -aux | grep $(whoami) | sed -e 's/sshd//g' | grep ssh | grep fNT | grep polaris) if [ -z "$s" ]; then echo "Opening background connection to Polaris" ssh -fNTY "$@" ${USER}@polaris.alcf.anl.gov else ssh -Y "$@" ${USER}@polaris.alcf.anl.gov fi }
- If
parsl
ends immediately with exit status 0 or crashes, it is usually a job queue issue. The first scenario usually meansparsl
has put jobs into the queue, and exited, while the second could be there are outstanding held jobs that should be manually removed withjobsub_rm
-
Part of running CORSIKA requires performing a copy of database files. The default method for copying is to use the IDFH tool provided by Fermilab, but this has issues on Polaris. Adding the line
physics.producers.corsika.ShowerCopyType: "DIRECT"
to the fcl file responsible for running CORSIKA suppresses the IDFH copy and uses the system default instead.
-
The
sbndata
package is not listed in the sbnd distribution manifest provided by SciSoft, but it is needed to produce CAF files with flux weights