-
Notifications
You must be signed in to change notification settings - Fork 0
lesson 4: managing jobs
The main job management task you will be concerned with is canceling jobs when
- You realize you started them by mistake (you forgot to change the parameter to what you wanted)
- You actually started the wrong job
- The job has been running for waaayyyy too long and is probably hung
- Any other good reason
However, SLURM will also allow you to adjust job attributes and options on the fly.
In the last chapter we introduced squeue
. Let's start some jobs of our own and look at them in the queue. First edit sleepytime.sub to sleep for 60 seconds.
./sleepytime.sh 60
Then start it twice with sbatch
balter@clusthead1:~/slurm_tutorial$ sbatch sleepytime.sub
Submitted batch job 2186
balter@clusthead1:~/slurm_tutorial$ sbatch sleepytime.sub
Submitted batch job 2187
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun 2 13:03:52 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2187 exacloud sleepyti balter RUNNING 0:06 1:00:00 1 clustnode-3-56
2186 exacloud sleepyti balter RUNNING 0:10 1:00:00 1 clustnode-3-56
A little later...
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun 2 13:04:17 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2187 exacloud sleepyti balter RUNNING 0:31 1:00:00 1 clustnode-3-56
2186 exacloud sleepyti balter RUNNING 0:35 1:00:00 1 clustnode-3-56
And finally...
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun 2 13:04:46 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
Edit sleepytime.sub to sleep for 10 minutes: ./sleepytime.sh 600
. Start the job four times and check the queue.
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun 2 13:09:14 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2191 exacloud sleepyti balter PENDING 0:00 1:00:00 1 (None)
2189 exacloud sleepyti balter RUNNING 0:33 1:00:00 1 clustnode-3-56
2190 exacloud sleepyti balter RUNNING 0:33 1:00:00 1 clustnode-3-56
2188 exacloud sleepyti balter RUNNING 0:36 1:00:00 1 clustnode-3-56
To kill one of the jobs, we use scancel
and supply the job name.
balter@clusthead1:~/slurm_tutorial$ scancel 2188
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun 2 13:09:47 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2191 exacloud sleepyti balter RUNNING 0:32 1:00:00 1 clustnode-3-56
2189 exacloud sleepyti balter RUNNING 1:06 1:00:00 1 clustnode-3-56
2190 exacloud sleepyti balter RUNNING 1:06 1:00:00 1 clustnode-3-56
You can also specify ranges of jobs to cancel if you are using the bash shell:
balter@clusthead1:~/slurm_tutorial$ scancel 219{0..1}
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun 2 13:13:44 2017
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2189 exacloud sleepyti balter RUNNING 5:03 1:00:00 1 clustnode-3-56
There are many other features to scancel
that you can read about in the documentation.
scontrol
http://www.nersc.gov/users/computational-systems/cori/running-jobs/monitoring-jobs/
Interactive srun jobs launched from the command line should normally be terminated with a SIGINT (CTRL-C):
The first CTRL-C will report the state of the tasks
A second CTRL-C within one second will terminate the tasks