Explain how to read record of jobs from the database #80

gkaf89 · 2024-07-27T20:28:47Z

The command

scontrol show job [<job id>]

can be used to print details of active and past jobs from the database. The relevant section should be extended to explain the meaning of the reported fields.

gkaf89 · 2024-12-27T17:18:34Z

Architecture

The job scheduling daemons and the applications of the user interfaces in Slurm are all synchronizing the system state through a database. The SlurmBDB (Slurm Database Daemon) is an application interface between the database and the services and application that access the database data. The reasons behind the use of slurmdbd are the following.

Authenticate communication between users and slurmdbd (using Munge).
Only slurmdbd requires permissions to access the database (avoid giving direct access to the users).
Implement various functions on the slurmdbd application level to simplify access to data.

Consider for instance the slurmstld scheduler daemon that runs in all clusters to schedule jobs, the sacct application that extracts information for jobs, and the sacctmgr application that manages accounts. The following diagram demonstrates the access pattern to the cluster database.

There is a single database for all clusters on site (UL HPC follows the same architecture). This configuration allows for cross access across the clusters, as for instance in applications that require submission of jobs and job dependencies across clusters. The back end is a MySQL database (the default and most tested); PostgreSQL is also supported, but with a few front end features missing.

Data access patterns

All slurmcltd daemons use information stored in the Slurm database to manage the cluster jobs. The slurmctld reads data from slurm.conf on startup and updates the database.

slurmdbd will push information to the slurmctld daemons when there is an update in the cluster setup data.
slurmctld daemons will cache data is slurmdbd is not responding.

Always update the Slurm database daemon (slurmdbd) first! All components synchronize with the database daemon, and the database daemon supports communication with components a few versions older, but newer components will not communicate with older database daemons.

Data scheme

The basic unit in the Slurm data base is an association, a combination of cluster, account, user name, and (optionally) partition name. Data are maintained by association. For instance, each association can have different limits assigned to it.

user=Pam Account=Beta FairShare=20% MaxTime=2horus MaxJobs=2

Association and account management

Account names are hierarchical, but no account name can be repeated (i.e. no cycles). Coordinators can manage users lower in the hierarchy. Resources are inherited from parent accounts.

The command to manage accounts and associations is sacctmgr which can modify

clusters,
accounts,
users, and
limits.

Fair Share Scheduling

Apart from hard absolute limits for association, the fair share scheduling plugin allows for a more flexible management of access to resources. Soft limits, for instance setting the available resource for an association to a fraction of the total available, will allow jobs scheduled for the association to the scheduled up to the limit, and when the limit is exceeded, the priority of all further scheduled jobs decreases.

Resources:

Tools of Slurm to monitor jobs

Slurm provides a set of tools to monitor and analyze jobs during their execution and after their completion.

Command	Description
`sacct`	Displays detailed accounting data for all jobs and job steps.
`sreport`	Generate aggregate reports from the Slurm accounting data.
`sstat`	Display the status information of a running job/step (more detailed than sacct).
`scontrol show <job id>`	Displays detailed info about a job.
`seff`	Takes a job id and reports on the efficiency of that job's cpu and memory utilization.
`smap` (not available in UL HPC)	Show jobs, partitions, and nodes in a graphical network topology.
`sprio`	View the factors that comprise a job's scheduling priority.
`squeue`	List your active jobs and their status.
`sreport`	Generate reports from the Slurm accounting data.

Derived scripts

The UL HPC provides some convenient shortcuts for a few common commands.

slist <job id> calls sacct with some preselected output options, followed by a call to seff.

Monitoring tools

gkaf89 self-assigned this Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explain how to read record of jobs from the database #80

Explain how to read record of jobs from the database #80

gkaf89 commented Jul 27, 2024

gkaf89 commented Dec 27, 2024

Explain how to read record of jobs from the database #80

Explain how to read record of jobs from the database #80

Comments

gkaf89 commented Jul 27, 2024

gkaf89 commented Dec 27, 2024

Architecture

Data access patterns

Data scheme

Association and account management

Fair Share Scheduling

Tools of Slurm to monitor jobs

Derived scripts

Monitoring tools