Skip to content

purdue-nrl/rcac-utils

 
 

Repository files navigation

Utility Bash Scripts for use on NRL's RCAC Clusters

This repository contains bash scripts for launching, orchestrating, managing, and monitoring jobs on NRL's RCAC clusters. RCAC uses the Simple Linux Utility for Resource Management (SLURM), a system providing job scheduling and job management on compute clusters. With SLURM, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them.

This README provides an overview of the prerequisites for using the cluster, a description of all provided scripts, and a few examples and common utilities.


Prerequisites

Access to the Cluster

Please verify you have access to the cluster before attempting to log in!

SSH Keys

Setup ssh keys on your machine using the following command:
ssh-keygen -t rsa

To copy this key to an RCAC cluster, use the following command:

ssh-copy-id $USER@$CLUSTER_NAME.rcac.purdue.edu

Verify your credentials using BoilerKey, and you're good to go! Logging in to the cluster now requires just your ssh key instead of BoilerKey+Duo!
NOTE: Windows users may use the following command in PowerShell to emulate the function of ssh-copy-id:

type $PATH_TO_.SSH\id_rsa.pub | ssh $USER@$CLUSTER_NAME.rcac.purdue.edu "cat >> .ssh/authorized_keys"

Cloning and Setup

Clone this repo into your user space on the cluster by copying the ssh URL and then using
git clone $REPO_URL

Navigate to the repo and perform initial setup using

cd $PATH_TO_REPO
bash setup.bash

NOTE: For path invariance, the setup script will automatically move the cloned repo to your home directory (/home/$USER)


Scripts

Setup Conda

RCAC clusters require use of the IT-managed conda module loadable using Lmod. While installing conda locally in your own directory /home/$USER/ is possible, environments installed using your own conda installation will not be importable in code, i.e., they will not work.

To transfer existing environments from other machines onto RCAC clusters, on the other machine, export the env as a yml file using

conda env export $ENVNAME>$FILENAME.yml

and then re-install on the cluster using the provided bash script:

bash conda_env_installer.bash -f $YML_FILENAME

A detailed list of command line args accepted by this script may be found using

bash conda_env_installer.bash -h

Job Submission Script

Launching a job requires the creation of a "Job Submission Script", a template of which is provided here as jobsubmissionscript.sub. A detailed list of command line args accepted by this script may be found using
bash jobsubmissionscript.sub -h

NOTE: It is not necessary for a job submission script to accept command line arguments. The template file provides this functionality just for convenience.

A Job Submission Script is supposed to do three main things:

  1. Load all necessary Lmod modules. A list of available modules may be found using:
    module avail
    
  2. Activate the necessary conda environment
    conda activate $ENV_NAME
    
  3. Call the job script
    python helloWorld.py
    

    NOTE: SLURM provides functionality to send an OS signal to a job "n" seconds before termination (n $\in [0, 65535]$). This functionality is enabled by default (See the minimum working example given in the script helloWorld.py)

More information on job submission scripts, specifically for RCAC clusters, may be found here.

Launching a Job

Jobs are launched using either the srun or the sbatch commands. Both these commands accept the same set of parameters. The main difference is that srun is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch is batch processing and non-blocking (results are written to a file and you can submit other commands right away).

If you use srun in the background with the & sign, then you remove the 'blocking' feature of srun, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.

If you use sbatch, you submit your job and it is handled by Slurm ; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process.

Use of the sbatch command is recommended on RCAC clusters. For convenience, we wrap the sbatch command in a wrapper file, a template of which is provided here as joblauncher.bash. Again, it is not necessary for the job launcher script to accept command line arguments. The template file provides this functionality just for convenience.

More info on the srun and sbatch commands may be found here: srun and sbatch

Monitoring Jobs

Launched jobs can be monitored using the monitor.bash file. A detailed list of command line args accepted by this script may be found using
bash monitor.bash -h

For more info about the commands used by this script, visit squeue.

Backing Up Files

The recommended file organisation on RCAC clusters is as follows:
  1. Code files and other important results: Your home directory (/home/$USER)
  2. Datasets and other large files: Your scratch directory (/scratch/$CLUSTER_NAME/$USER/)
  3. Temporary/code-generated files: The temporary directory (/tmp/). For the dos and don'ts of /tmp/, read this
It is advisable to backup all code, result, and render files. Files can be backed up to FORTRESS, which is a tape-based backup media managed by Purdue IT. Since FORTRESS relies on tape, it is recommended to group large files into a tar archive before pushing them to FORTRESS. The provided script can do this automatically.

By default, backups in FORTRESS are saved in your FORTRESS home directory (/home/$USER/). For convenience, we impose the following path organisation:

  1. All tar archives are to be saved in /home/$USER/archives/
  2. All other files (note that these should only be large files such as datasets, model weights, etc.) are to be saved in /home/$USER/largeFiles/
You are free to use whatever directory structure you want inside these two directories.

To backup to FORTRESS, use

 bash backup.bash $FILES_TO_BACKUP

NOTE: The backup script accepts wildcards (the * character), i.e., if you want to backup multiple files named file1, file2, ...., filen, then all of them can be backed up in a single call to backup.bash using

 bash backup.bash file*

A detailed list of command line args accepted by this script may be found using:

bash backup.bash -h

If you have never used tar/server keytabs/sftp/tape archives before, the provided script is designed to guide you through the steps needed to backup data to FORTRESS. Simply follow the instructions on screen!

Retrieving Backups

To retrieve backed-up file(s), use the script retrieve_backup.bash. A detailed list of command line args accepted by this script may be found using:
bash retrieve_backup.bash -h

NOTE: This script provides auto-untarring functionality when retrieving tar archives.


Examples

Consider the following case:

The file setup.bash has been executed and all paths have been setup correctly (Note that at this point, this repository will be located in /home/$USER). The script file file.py, located in directory /home/$USER/test is to be executed in a conda environment named env.

Let us assume that the script requires 2 GPU cards, and as many CPUs as possible (14*N_GPU=28. For info on why 28, read the help message of joblauncher.bash)

Let us also assume that the script is to be run on the "ai" partition and that the user estimates a max runtime of 2.5 days. Additionally, the user determines that, in case the job runs for longer than 2.5 days, the necessary checkpoint and metadata saving will take ~97 seconds.

Given these considerations, the job should be launched using the following command:

bash joblauncher.bash -j jobsubmissionscript.sub -t python -d ~/test/ -f file.py -e env -g 2 -c 28 -p ai -T 2-12:00:00 -s 97

Keeping all other conditions the same, if the script were to change from file.py to file.bash, then the job should be launched using

bash joblauncher.bash -j jobsubmissionscript.sub -t bash -d ~/test/ -f file.bash -e env -g 2 -c 28 -p ai -T 2-12:00:00 -s 97

NOTE: In most cases, a majority of the supported command line arguments will be left at their default values. All the arguments supported by the file joblauncher.bash are expanded in the above example, just for convenience.


Common Utilities

To display a list of your active jobs (running/enqueued), use
squeue -u $USER

If your job is waiting in the SLURM queue, you can get an estimate of its start time using

scontrol show job $JOB_ID | grep StartTime

To cancel a running/pending job, use

scancel $JOB_ID

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 98.8%
  • Python 1.2%