Skip to content

lesson 4: managing jobs

Unknown edited this page Aug 23, 2017 · 2 revisions

Managing Jobs

The main job management task you will be concerned with is canceling jobs when

  • You realize you started them by mistake (you forgot to change the parameter to what you wanted)
  • You actually started the wrong job
  • The job has been running for waaayyyy too long and is probably hung
  • Any other good reason

However, SLURM will also allow you to adjust job attributes and options on the fly.

Job Status

In the last chapter we introduced squeue. Let's start some jobs of our own and look at them in the queue. First edit sleepytime.sub to sleep for 60 seconds.

./sleepytime.sh 60

Then start it twice with sbatch

balter@clusthead1:~/slurm_tutorial$ sbatch sleepytime.sub
Submitted batch job 2186
balter@clusthead1:~/slurm_tutorial$ sbatch sleepytime.sub
Submitted batch job 2187
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun  2 13:03:52 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              2187  exacloud sleepyti   balter  RUNNING       0:06   1:00:00      1 clustnode-3-56
              2186  exacloud sleepyti   balter  RUNNING       0:10   1:00:00      1 clustnode-3-56

A little later...

balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun  2 13:04:17 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              2187  exacloud sleepyti   balter  RUNNING       0:31   1:00:00      1 clustnode-3-56
              2186  exacloud sleepyti   balter  RUNNING       0:35   1:00:00      1 clustnode-3-56

And finally...

balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun  2 13:04:46 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)

Canceling Jobs

Edit sleepytime.sub to sleep for 10 minutes: ./sleepytime.sh 600. Start the job four times and check the queue.

balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun  2 13:09:14 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              2191  exacloud sleepyti   balter  PENDING       0:00   1:00:00      1 (None)
              2189  exacloud sleepyti   balter  RUNNING       0:33   1:00:00      1 clustnode-3-56
              2190  exacloud sleepyti   balter  RUNNING       0:33   1:00:00      1 clustnode-3-56
              2188  exacloud sleepyti   balter  RUNNING       0:36   1:00:00      1 clustnode-3-56

To kill one of the jobs, we use scancel and supply the job name.

balter@clusthead1:~/slurm_tutorial$  scancel 2188
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun  2 13:09:47 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              2191  exacloud sleepyti   balter  RUNNING       0:32   1:00:00      1 clustnode-3-56
              2189  exacloud sleepyti   balter  RUNNING       1:06   1:00:00      1 clustnode-3-56
              2190  exacloud sleepyti   balter  RUNNING       1:06   1:00:00      1 clustnode-3-56

You can also specify ranges of jobs to cancel if you are using the bash shell:

balter@clusthead1:~/slurm_tutorial$ scancel 219{0..1}
balter@clusthead1:~/slurm_tutorial$ squeue -l
Fri Jun  2 13:13:44 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              2189  exacloud sleepyti   balter  RUNNING       5:03   1:00:00      1 clustnode-3-56

There are many other features to scancel that you can read about in the documentation.

Controlling Jobs

scontrol

http://www.nersc.gov/users/computational-systems/cori/running-jobs/monitoring-jobs/

Interactive srun jobs launched from the command line should normally be terminated with a SIGINT (CTRL-C):

The first CTRL-C will report the state of the tasks
A second CTRL-C within one second will terminate the tasks