Skip to content
Thomas edited this page Jan 26, 2024 · 32 revisions

This wiki documents the steps required to run code for the Short Baseline Neutrino (SBN) program on the Polaris supercomputer at Argonne National Laboratory.

Overview

SBN-related code is built on the LArSoft framework. The system libraries required to build and run LArSoft and related packages are provided using a Scientific Linux 7 container. Pre-compiled versions of LArSoft and experiment-specific software is downloaded from manifests available at https://scisoft.fnal.gov/.

Once LArSoft is installed, it becomes possible to load experiment-specific software via ups in the same was as on Fermilab virtual machines, e.g.,

source ${LARSOFT_ROOT}/setup
setup sbndcode v09_75_03_02 -q e20:prof

Disk resources required to run the code are divided into two filesystems available on Polaris: eagle and grand. The grand filesystem contains compiled code and input files, while eagle is used for outputs and transfers.

Getting Started

  1. Request a user account on Polaris with access to neutrinoGPU project.
  2. Once logged in, create a local Conda environment and install parsl:
    module load conda/2023-10-04
    conda activate
    conda create -n sbn python=3.10
    conda activate sbn
    pip install parsl
    
  3. Clone the sbnd_parsl repository to your home directory. Modify the entry_point.py program to adjust the list of .fcl files, change submission configuration options, etc.
  4. Submit jobs by running the entry_point.py program, e.g. python sbnd_parsl/entry_point.py -o /lus/eagle/projects/neutrinoGPU/my-production

Developing and Testing on Polaris

Getting UPS Products

The pullProducts script handles downloading and extracting tarballs of pre-compiled Fermilab software. These software distributions can then be loaded via UPS. As an example, we can download the SBND software distribution and its dependencies into our project's larsoft directory via:

./pullProducts /lus/grand/projects/neutrinoGPU/software/larsoft/ slf7 sbnd-v09_78_00 e20 prof

The argument sbnd-v09_78_00 is a software bundle provided by SciSoft at Fermilab. A list of available bundles can be found at https://scisoft.fnal.gov/scisoft/bundles/.

Interactive Jobs

You can test software within an interactive job. To begin an interactive job, create a script called interactive_job.sh with the following contents and run it:

#!/bin/sh
# Start an interactive job

ALLOCATION="neutrinoGPU"
FILESYSTEM="home:grand:eagle"

qsub -I -l select=1 -l walltime=0:45:00 -q debug \
        -A "${ALLOCATION}" -l filesystems="${FILESYSTEM}"

Once a slot on the debug queue becomes available, you will be automatically connected to a prompt within the interactive job.

The following script executes a single .fcl file by setting up LArSoft in a singularity container:

#!/bin/bash
# Start singularity and run a fcl. intended to be run from inside an interactive job

LARSOFT_DIR="/lus/grand/projects/neutrinoGPU/software/larsoft"
SOFTWARE="sbndcode"
VERSION="v09_78_00"
# SOFTWARE="icaruscode"
# VERSION="v09_78_04"
QUAL="e20:prof"

ALLOCATION="neutrinoGPU"
FILESYSTEM="home:grand:eagle"
CONTAINER="/lus/grand/projects/neutrinoGPU/software/slf7.sif"

module load singularity
singularity run -B /lus/eagle/ -B /lus/grand/ ${CONTAINER} << EOF
source ${LARSOFT_DIR}/setup
setup ${SOFTWARE} ${VERSION} -q ${QUAL}
lar -c ${@}
EOF

Running Jobs

With a properly configured conda environment, you can submit your jobs from the login nodes by running the parsl workflows as regular Python programs, e.g.,

~: python workflow.py

The specific options of your job submission can be defined within your workflow program. The sbnd_parsl code provides some functions for configuring Parsl to run on Polaris.

The main resource for job-related information can be found here: https://docs.alcf.anl.gov/polaris/running-jobs/. Ideally, you will be able to test your code using the debug queue which allows you to submit to one node at a time. Once your code works on the debug queue, other queues, debug-scaling and prod may be used for larger-scale productions.

Jobs with mis-configured resource requests, e.g., a debug queue job with walltime larger than 1 hour or more than 2 requested nodes, will not run. Consult the link above for the list of appropriate resource requests. Note that the prod queue is a routing queue. Your job will be automatically assigned to a specific small, medium, or large queue depending on the resources requested.

Tips and Tricks

  • The program pbsq is installed at /lus/grand/projects/neutrinoGPU/software/pbsq. It produces more readable output about job status and can be invoked with pbsq -f neutrinoGPU.

  • Once your job is running you can ssh into the worker node.  Get the node with qstat -u $(whoami) or via pbsq, it should start with "x." Once connected, you can check the memory usage and other metrics with e.g. cat /proc/meminfo.

  • Individual job history can be checked with qstat -xf <jobid>

  • You can log in to Polaris once via an ssh tunnel, and allow future ssh connections to connect without requiring authentication. Place the function in your computer's local .bashrc or .zshrc file:

    connect_polaris () {
        # macOS (BSD-based ps)
        # s=$(ps -Ao user,pid,%cpu,%mem,vsz,rss,tt,stat,start,time,command \
        #     | grep $(whoami) | sed -e 's/sshd//g' | grep ssh | grep fNT | grep polaris)
        # Unix
        s=$(ps -aux | grep $(whoami) | sed -e 's/sshd//g' | grep ssh | grep fNT | grep polaris) 
        if [ -z "$s" ]; then
            echo "Opening background connection to Polaris"
            ssh -fNTY "$@" ${USER}@polaris.alcf.anl.gov
        else
            ssh -Y "$@" ${USER}@polaris.alcf.anl.gov
        fi
    }
  • If parsl ends immediately with exit status 0 or crashes, it is usually a job queue issue. The first scenario usually means parsl has put jobs into the queue, and exited, while the second could be there are outstanding held jobs that should be manually removed with jobsub_rm

  • To get additional UPS products that are not listed in a manifest, you can instead use a local manifest with pullProducts. Start by downloading a manifest for a software distribution by passing the -M flag:

    ./pullProducts -M /lus/grand/projects/neutrinoGPU/software/larsoft/ slf7 icarus-v09_78_04 e20 prof
    

    This will create a file called icarus-09.78.04-Linux64bit+3.10-2.17-e20-prof_MANIFEST.txt. You can modify the file to include additional products. Below, we add specific versions of larbatch and icarus_data requested by icaruscode to the manifest which are not listed in the manifest provided from SciSoft:

    icarus_signal_processing v09_78_04       icarus_signal_processing-09.78.04-slf7-x86_64-e20-prof.tar.bz2
    icarusalg            v09_78_04       icarusalg-09.78.04-slf7-x86_64-e20-prof.tar.bz2
    icaruscode           v09_78_04       icaruscode-09.78.04-slf7-x86_64-e20-prof.tar.bz2
    icarusutil           v09_75_00       icarusutil-09.75.00-slf7-x86_64-e20-prof.tar.bz2
    larbatch             v01_58_00       larbatch-01.58.00-noarch.tar.bz2                             -f NULL
    icarus_data          v09_79_02       icarus_data-09.79.02-noarch.tar.bz2                          -f NULL
    

    You can now re-run the pullProducts command with the -l flag to have the script use the local manifest instead. Note that the file name is automatically deduced based on the final three arguments, so do not modify the file name of the downloaded manifest.

    ./pullProducts -l /lus/grand/projects/neutrinoGPU/software/larsoft/ slf7 icarus-v09_78_04 e20 prof
    

Things to watch out for

  • Part of running CORSIKA requires performing a copy of database files. The default method for copying is to use the IDFH tool provided by Fermilab, but this has issues on Polaris. Adding the line

    physics.producers.corsika.ShowerCopyType: "DIRECT"
    

    to the fcl file responsible for running CORSIKA suppresses the IDFH copy and uses the system default instead.

  • The sbndata package is not listed in the sbnd distribution manifest provided by SciSoft, but it is needed to produce CAF files with flux weights

  • Worker nodes can't access your home directory, so make sure your job outputs are being sent to eagle or grand filesystems.

  • Both the login nodes and the worker nodes must use the same version of parsl. The parsl version on the worker nodes is chosen based on the worker_init line in the setup of the provider class (sbnd_parsl/utils.py). Specific versions of parsl can be installed in your Python environment on the login nodes via, e.g., `pip install --force-reinstall -v "parsl==2023.10.04".

Further Reading

https://docs.alcf.anl.gov/polaris/getting-started/

https://parsl.readthedocs.io/en/stable/

Clone this wiki locally