This repository contains bash scripts for launching, orchestrating, managing, and monitoring jobs on NRL's RCAC clusters. RCAC uses the Simple Linux Utility for Resource Management (SLURM), a system providing job scheduling and job management on compute clusters. With SLURM, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them.
This README provides an overview of the prerequisites for using the cluster, a description of all provided scripts, and a few examples and common utilities.
Please verify you have access to the cluster before attempting to log in!
Setup ssh keys on your machine using the following command:
ssh-keygen -t rsa
ssh-keygen -t rsa
To copy this key to an RCAC cluster, use the following command:
ssh-copy-id $USER@$CLUSTER_NAME.rcac.purdue.edu
Verify your credentials using BoilerKey, and you're good to go! Logging in to the cluster now requires just your ssh key instead of BoilerKey+Duo!
NOTE: Windows users may use the following command in PowerShell to emulate the function of ssh-copy-id:
type $PATH_TO_.SSH\id_rsa.pub | ssh $USER@$CLUSTER_NAME.rcac.purdue.edu "cat >> .ssh/authorized_keys"
git clone $REPO_URL
Navigate to the repo and perform initial setup using
cd $PATH_TO_REPO
bash setup.bash
NOTE: For path invariance, the setup script will automatically move the cloned repo to your home directory (/home/$USER
)
RCAC clusters require use of the IT-managed conda module loadable using Lmod. While installing conda locally in your own directory
/home/$USER/
is possible, environments installed using your own conda installation will not be importable in code, i.e., they will not work.
To transfer existing environments from other machines onto RCAC clusters, on the other machine, export the env as a yml file using
conda env export $ENVNAME>$FILENAME.yml
and then re-install on the cluster using the provided bash script:
bash conda_env_installer.bash -f $YML_FILENAME
A detailed list of command line args accepted by this script may be found using
bash conda_env_installer.bash -h
jobsubmissionscript.sub
. A detailed list of command line args accepted by this script may be found using
bash jobsubmissionscript.sub -h
NOTE: It is not necessary for a job submission script to accept command line arguments. The template file provides this functionality just for convenience.
A Job Submission Script is supposed to do three main things:
- Load all necessary Lmod modules. A list of available modules may be found using:
module avail
- Activate the necessary conda environment
conda activate $ENV_NAME
- Call the job script
python helloWorld.py
NOTE: SLURM provides functionality to send an OS signal to a job "n" seconds before termination (n
$\in [0, 65535]$ ). This functionality is enabled by default (See the minimum working example given in the script helloWorld.py)
srun
or the sbatch
commands. Both these commands accept the same set of parameters. The main difference is that srun
is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch
is batch processing and non-blocking (results are written to a file and you can submit other commands right away).
If you use srun
in the background with the & sign, then you remove the 'blocking' feature of srun
, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun
processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.
If you use sbatch
, you submit your job and it is handled by Slurm ; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process.
Use of the sbatch
command is recommended on RCAC clusters. For convenience, we wrap the sbatch
command in a wrapper file, a template of which is provided here as joblauncher.bash
. Again, it is not necessary for the job launcher script to accept command line arguments. The template file provides this functionality just for convenience.
More info on the srun
and sbatch
commands may be found here: srun and sbatch
monitor.bash
file. A detailed list of command line args accepted by this script may be found using
bash monitor.bash -h
For more info about the commands used by this script, visit squeue.
The recommended file organisation on RCAC clusters is as follows:- Code files and other important results: Your home directory (
/home/$USER
) - Datasets and other large files: Your scratch directory (
/scratch/$CLUSTER_NAME/$USER/
) - Temporary/code-generated files: The temporary directory (
/tmp/
). For the dos and don'ts of/tmp/
, read this
By default, backups in FORTRESS are saved in your FORTRESS home directory (/home/$USER/
). For convenience, we impose the following path organisation:
- All tar archives are to be saved in
/home/$USER/archives/
- All other files (note that these should only be large files such as datasets, model weights, etc.) are to be saved in
/home/$USER/largeFiles/
To backup to FORTRESS, use
bash backup.bash $FILES_TO_BACKUP
NOTE: The backup script accepts wildcards (the * character), i.e., if you want to backup multiple files named file1, file2, ...., filen, then all of them can be backed up in a single call to backup.bash
using
bash backup.bash file*
A detailed list of command line args accepted by this script may be found using:
bash backup.bash -h
If you have never used tar/server keytabs/sftp/tape archives before, the provided script is designed to guide you through the steps needed to backup data to FORTRESS. Simply follow the instructions on screen!
To retrieve backed-up file(s), use the scriptretrieve_backup.bash
. A detailed list of command line args accepted by this script may be found using:
bash retrieve_backup.bash -h
NOTE: This script provides auto-untarring functionality when retrieving tar archives.
Consider the following case:
The file setup.bash
has been executed and all paths have been setup correctly (Note that at this point, this repository will be located in /home/$USER
). The script file file.py
, located in directory /home/$USER/test
is to be executed in a conda environment named env
.
Let us assume that the script requires 2 GPU cards, and as many CPUs as possible (14*N_GPU=28. For info on why 28, read the help message of joblauncher.bash)
Let us also assume that the script is to be run on the "ai" partition and that the user estimates a max runtime of 2.5 days. Additionally, the user determines that, in case the job runs for longer than 2.5 days, the necessary checkpoint and metadata saving will take ~97 seconds.
Given these considerations, the job should be launched using the following command:
bash joblauncher.bash -j jobsubmissionscript.sub -t python -d ~/test/ -f file.py -e env -g 2 -c 28 -p ai -T 2-12:00:00 -s 97
Keeping all other conditions the same, if the script were to change from file.py
to file.bash
, then the job should be launched using
bash joblauncher.bash -j jobsubmissionscript.sub -t bash -d ~/test/ -f file.bash -e env -g 2 -c 28 -p ai -T 2-12:00:00 -s 97
NOTE: In most cases, a majority of the supported command line arguments will be left at their default values. All the arguments supported by the file joblauncher.bash
are expanded in the above example, just for convenience.
To display a list of your active jobs (running/enqueued), use
squeue -u $USER
If your job is waiting in the SLURM queue, you can get an estimate of its start time using
scontrol show job $JOB_ID | grep StartTime
To cancel a running/pending job, use
scancel $JOB_ID