Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain how to read record of jobs from the database #80

Open
gkaf89 opened this issue Jul 27, 2024 · 1 comment
Open

Explain how to read record of jobs from the database #80

gkaf89 opened this issue Jul 27, 2024 · 1 comment
Assignees

Comments

@gkaf89
Copy link
Collaborator

gkaf89 commented Jul 27, 2024

The command

scontrol show job [<job id>]

can be used to print details of active and past jobs from the database. The relevant section should be extended to explain the meaning of the reported fields.

@gkaf89 gkaf89 self-assigned this Jul 27, 2024
@gkaf89
Copy link
Collaborator Author

gkaf89 commented Dec 27, 2024

Architecture

The job scheduling daemons and the applications of the user interfaces in Slurm are all synchronizing the system state through a database. The SlurmBDB (Slurm Database Daemon) is an application interface between the database and the services and application that access the database data. The reasons behind the use of slurmdbd are the following.

  • Authenticate communication between users and slurmdbd (using Munge).
  • Only slurmdbd requires permissions to access the database (avoid giving direct access to the users).
  • Implement various functions on the slurmdbd application level to simplify access to data.

Consider for instance the slurmstld scheduler daemon that runs in all clusters to schedule jobs, the sacct application that extracts information for jobs, and the sacctmgr application that manages accounts. The following diagram demonstrates the access pattern to the cluster database.

Slurm database architecture

There is a single database for all clusters on site (UL HPC follows the same architecture). This configuration allows for cross access across the clusters, as for instance in applications that require submission of jobs and job dependencies across clusters. The back end is a MySQL database (the default and most tested); PostgreSQL is also supported, but with a few front end features missing.

Data access patterns

All slurmcltd daemons use information stored in the Slurm database to manage the cluster jobs. The slurmctld reads data from slurm.conf on startup and updates the database.

  • slurmdbd will push information to the slurmctld daemons when there is an update in the cluster setup data.
  • slurmctld daemons will cache data is slurmdbd is not responding.

Always update the Slurm database daemon (slurmdbd) first! All components synchronize with the database daemon, and the database daemon supports communication with components a few versions older, but newer components will not communicate with older database daemons.

Data scheme

The basic unit in the Slurm data base is an association, a combination of cluster, account, user name, and (optionally) partition name. Data are maintained by association. For instance, each association can have different limits assigned to it.

user=Pam Account=Beta FairShare=20% MaxTime=2horus MaxJobs=2

Association and account management

Account names are hierarchical, but no account name can be repeated (i.e. no cycles). Coordinators can manage users lower in the hierarchy. Resources are inherited from parent accounts.

Slurm resource inheritance

The command to manage accounts and associations is sacctmgr which can modify

  • clusters,
  • accounts,
  • users, and
  • limits.

Fair Share Scheduling

Apart from hard absolute limits for association, the fair share scheduling plugin allows for a more flexible management of access to resources. Soft limits, for instance setting the available resource for an association to a fraction of the total available, will allow jobs scheduled for the association to the scheduled up to the limit, and when the limit is exceeded, the priority of all further scheduled jobs decreases.

Resources:

  1. Slurm Database Usage, Part 1
  2. Slurm Database Usage, Part 2

Tools of Slurm to monitor jobs

Slurm provides a set of tools to monitor and analyze jobs during their execution and after their completion.

Command Description
sacct Displays detailed accounting data for all jobs and job steps.
sreport Generate aggregate reports from the Slurm accounting data.
sstat Display the status information of a running job/step (more detailed than sacct).
scontrol show <job id> Displays detailed info about a job.
seff Takes a job id and reports on the efficiency of that job's cpu and memory utilization.
smap (not available in UL HPC) Show jobs, partitions, and nodes in a graphical network topology.
sprio View the factors that comprise a job's scheduling priority.
squeue List your active jobs and their status.
sreport Generate reports from the Slurm accounting data.

Derived scripts

The UL HPC provides some convenient shortcuts for a few common commands.

  • slist <job id> calls sacct with some preselected output options, followed by a call to seff.

Monitoring tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant