-
Notifications
You must be signed in to change notification settings - Fork 0
lesson 3: slurm options
- Introduction
- Overview of Options
- Overview of Environment Variables
- Machine Options
- Misc. Options
- Resources
Whether you are launching an interactive job or a batch job, SLURM lets you specify options that help both you and the scheduler---and hence the other users. Asking for enough memory/cpu/time insures that your jobs will run optimally. Asking for too much of these increases the amount of time your jobs will wait in the queue, and also allocates resources you don't need from other users who do need them.
SLURM lets you fine tune your jobs with a number of options. Many can be used by one-off, batch, and interactive jobs. Some are only available for batch jobs. The options included in this lesson are not exhaustive. For more detail you should consider the documentation and other resources provided in the resources section.
SLURM allows you to specify these options either in a batch file or at the command line with a call to sbatch
or srun
. We have not yet gotten to the salloc
command, but that also can take command line options.
In addition to flag options, SLURM creates and uses a number of special environment variables that help you configure information passed to the terminal and output files. You can also set some of these explicitly, either in batch files or using the export
command at the command line.
One specific set of options and environment variables that we will cover in a different lesson are those for array jobs <> which can be used for launching groups of similar tasks to run simultaneously. Another set we will cover later are those used for creating job dependencies in which some jobs wait for other jobs to complete before they run.
Below is a very brief list of some of the most common options you can supply to SLURM. You can find more in the man pages and online documentation <<links>>
. Most options are fairly straightforward. The machine-level specifications can be a bit subtle. Below the table we will discuss the differences and use cases of
- Tasks vs. Jobs vs. Threads
- Cores vs. Sockets vs. Nodes
Command | Description |
---|---|
srun --time 1:00:00 --job-name=myjob --output=myjob.out --error=myjob.err myjob arg1 arg2
|
run myjob arg1 arg2 asking for at most one hour of run time, write stdout to myjob.out, and write stderr to myjob.err. |
srun --cpus-per-task=8 --time-min=30:00 my_threaded_job arg1 arg2
|
run my_threaded_job arg1 arg2 asking for at least 30 min of run time and 8 cpus |
Equivalent sbatch
examples:
mybatch_1.sub
#SBATCH --time 1:00:00
#SBATCH -job-name=myjob
#SBATCH --output=myjob.out
#SBATCH --error=myjob.err
myjob arg1 arg2
mybatch_2.sub
#SBATCH --cpus-per-task=8
#SBATCH --time-min=30:00
my_threaded_job arg1 arg2
Flag | Description |
---|---|
-B --extra-node-info=<sockets[:cores[:threads]]> |
Restrict node selection to nodes with at least the specified number of sockets, cores per socket and/or threads per core. Each value specified is considered a minimum. Values can also be specified as min-max. The individual levels can also be specified in separate options if desired (see below): --sockets-per-node=--cores-per-socket= --threads-per-core= |
--cores-per-socket=<cores> |
Restrict node selection to nodes with at least the specified number of cores per socket. |
-c, --cpus-per-task=<ncpus> |
Request that ncpus be allocated per process. This may be useful if the job is multithreaded and requires more than one CPU per task for optimal performance. The default is one CPU per process. If -c is specified without -n, as many tasks will be allocated per node as possible while satisfying the -c restriction. For instance on a cluster with 8 CPUs per node, a job request for 4 nodes and 3 CPUs per task may be allocated 3 or 6 CPUs per node (1 or 2 tasks per node) depending upon resource consumption by other jobs. Such a job may be unable to execute more than a total of 4 tasks. This option may also be useful to spawn tasks without allocating resources to the job step from the job's allocation when running multiple job steps with the --exclusive option. |
-d, --dependency=<dependency_list> |
Defer the start of this job until the specified dependencies have been satisfied completed. This option does not apply to job steps (executions of srun within an existing salloc or sbatch allocation) only to job allocations. <dependency_list> is of the form <type:job_id[:job_id][,type:job_id[:job_id]]> or <type:job_id[:job_id][?type:job_id[:job_id]]> . |
-D, --chdir=<path> |
Have the remote processes do a chdir to path before beginning execution. The default is to chdir to the current working directory of the srun process. The path can be specified as full path or relative path to the directory where the command is executed. This option applies to job allocations. |
-e, --error=<filename pattern> |
Specify how stderr is to be redirected. By default in interactive mode, srun redirects stderr to the same file as stdout, if one is specified. |
-E, --preserve-env |
Pass the current values of environment variables SLURM_NNODES and SLURM_NTASKS through to the executable, rather than computing them from commandline parameters. This option applies to job allocations. |
-J, --job-name=<jobname> |
Specify a name for the job. The specified name will appear along with the job id number when querying running jobs on the system. The default is the supplied executable program's name. |
--mem=<size[units]> |
Specify the real memory required per node. Default units are megabytes unless the SchedulerParameters configuration parameter includes the default_gbytes option for gigabytes. Different units can be specified using the suffix [K |M | G | T]. |
--mem-per-cpu=<size[units]> |
Minimum memory required per allocated CPU. Default units are megabytes unless the SchedulerParameters configuration parameter includes the default_gbytes option for gigabytes. Different units can be specified using the suffix [K | M | G | T]. |
--mincpus=<n> |
Specify a minimum number of logical cpus/processors per node. |
-N, --nodes=<minnodes[-maxnodes]> |
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. The partition's node limits supersede those of the job. If a job's node limits are outside of the range permitted for its associated partition, the job will be left in a PENDING state. This permits possible execution at a later time, when the partition limit is changed. |
-n, --ntasks=<number> |
Specify the number of tasks to run. The default is one task per node, but note that the --cpus-per-task option will change this default. |
--ntasks-per-core=<ntasks> |
Request the maximum ntasks be invoked on each core. Meant to be used with the --ntasks option. Related to --ntasks-per-node except at the core level instead of the node level. |
--ntasks-per-node=<ntasks> |
Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. This is related to --cpus-per-task=ncpus, but does not require knowledge of the actual number of cpus on each node. |
--ntasks-per-socket=<ntasks> |
Request the maximum ntasks be invoked on each socket. Meant to be used with the --ntasks option. Related to --ntasks-per-node except at the socket level instead of the node level. |
-o, --output=<filename pattern> |
Specify the filename pattern for stdout redirection. By default in interactive mode, srun collects stdout from all tasks and sends this output via TCP/IP to the attached terminal. With --output stdout may be redirected to a file, to one file per task, or to /dev/null. If --error is not also specified on the command line, both stdout and stderr will directed to the file specified by --output. |
-p, --partition=<partition_names> |
Request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator. If the job can use more than one partition, specify their names in a comma separate list and the one offering earliest initiation will be used with no regard given to the partition name ordering (although higher priority partitions will be considered first). |
--sockets-per-node=<sockets> |
Restrict node selection to nodes with at least the specified number of sockets. |
--threads-per-core=<threads> |
Restrict node selection to nodes with at least the specified number of threads per core. |
--time-min=<time> |
Set a minimum time limit on the job allocation. If specified, the job may have it's --time limit lowered to a value no lower than --time-min if doing so permits the job to begin execution earlier than otherwise possible. The job's time limit will not be changed after the job is allocated resources. This is performed by a backfill scheduling algorithm to allocate resources otherwise reserved for higher priority jobs. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". |
-t, --time=<time> |
Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". |
When you invoke a SLURM job, SLURM sets a number of environment variables that users can use to customize output file names and how the jobs appear in the queue. Doing so helps the user to manage their jobs. Some important uses are:
- Suspending/killing/altering jobs by name
- Listing jobs by name (for example
squeue | grep jobname
- Viewing job information to make sure resources were properly allocated (e.g. number of cpus, tasks or nodes)
In Lesson 1 you saw how the output of squeue
listed jobs by name. In Lesson 4 you will learn how to suspend and cancel jobs by name (and job ID). Below we describe how to use SLURM environment variables to name log files.
The quickest way to get a list of the environment variables is to run:
balter@clusthead1:~/slurm_tutorial$ srun env | grep SLURM
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
SLURM_CLUSTER_NAME=exacloud-c7
SLURM_SUBMIT_DIR=/home/users/balter/slurm_tutorial
SLURM_SUBMIT_HOST=clusthead1
SLURM_JOB_NAME=env
SLURM_JOB_CPUS_PER_NODE=1
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_DISTRIBUTION=cyclic
SLURM_JOB_ID=87
SLURM_JOBID=87
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES=1
SLURM_NODELIST=clustnode-4-44
SLURM_JOB_PARTITION=exacloud
SLURM_TASKS_PER_NODE=1
SLURM_SRUN_COMM_PORT=44477
SLURM_JOB_ACCOUNT=exacloud
SLURM_JOB_QOS=normal
SLURM_STEP_NODELIST=clustnode-4-44
SLURM_JOB_NODELIST=clustnode-4-44
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=44477
SLURM_SRUN_COMM_HOST=172.20.12.33
SLURM_TOPOLOGY_ADDR=clustnode-4-44
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=1
SLURM_TASK_PID=103600
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=172.20.12.33
SLURM_GTIDS=0
SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
SLURM_JOB_UID=4296
SLURM_JOB_USER=balter
SLURM_JOB_GID=3010
SLURMD_NODENAME=clustnode-4-44
Variable | Meaning |
---|---|
SLURM_JOB_NAME | Custom job name---default is executable |
SLURM_JOBID | Job ID |
SLURM_SUBMIT_DIR | Job submission directory |
SLURM_SUBMIT_HOST | Name of host from which job was submitted |
SLURM_JOB_NODELIST | Names of nodes allocated to job |
SLURM_ARRAY_TASK_ID | Task id within job array |
SLURM_JOB_CPUS_PER_NODE | CPU cores per node allocated to job |
SLURM_NNODES | Number of nodes allocated to job |
https://slurm.schedmd.com/srun.html#lbAF
https://github.com/BYUHPC/BYUJobScriptGenerator
Command | Description |
---|---|
srun -n64 -ppdebug my_app
|
64 process job run interactively in pdebug partition |
srun -N64 -n512 my_threaded_app |
512 process job using 64 nodes. Assumes pbatch partition. |
srun -N4 -n16 -c4 my_threaded_app |
4 node, 16 process job with 4 cores (threads) per process. Assumes pbatch partition. |
srun -N8 my_app |
8 node job with a default value of one task per node (8 tasks). Assumes pbatch partition. |
srun -n128 -o my_app.out my_app |
128 process job that redirects stdout to file my_app.out. Assumes pbatch partition. |
srun -n32 -ppdebug -i my.inp my_app |
32 process interactive job; each process accepts input from a file called my.inp instead of stdin |
Tasks, jobs, cpus, sockets, nodes With tables:
https://slurm.schedmd.com/cpu_management.html
Advanced configuration generator--just for reference
https://slurm.schedmd.com/configurator.html
Nodes, tasks and threads
https://computing.llnl.gov/tutorials/bgq/index.html#NodesTasksThreads