Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable use of custom Slurm builds #163

Draft
wants to merge 49 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
582e15d
remove drain and resume functionality
sjpb Apr 21, 2023
b5af186
allow install and runtime taskbooks to be used directly
sjpb Apr 25, 2023
a53ba13
Merge branch 'master' into installonly
sjpb Apr 25, 2023
8d3bac8
Merge branch 'master' into installonly
sjpb May 12, 2023
47b2fd1
fix linter complaints
sjpb May 12, 2023
fe139b2
fix slurmctld state
sjpb May 12, 2023
28baf23
Merge branch 'master' into installonly
sjpb Sep 13, 2023
080cf97
move common tasks to pre.yml
sjpb Sep 19, 2023
f83e334
remove unused openhpc_slurm_service
sjpb Sep 19, 2023
77a628f
fix ini_file use for some community.general versions
sjpb Sep 19, 2023
5d88ca5
fix var precedence in molecule test13
sjpb Sep 19, 2023
33ad0e2
fix var precedence in all molecule tests
sjpb Sep 19, 2023
9683401
fix slurmd always starting on control node
sjpb Sep 19, 2023
d4163bc
move install to install-ohpc.yml
sjpb Sep 19, 2023
d4c5621
remove unused ohpc_slurm_services var
sjpb Sep 19, 2023
09cb57a
Merge branch 'installonly' into feat/no-ohpc
sjpb Sep 19, 2023
5090860
add install-generic for binary-only install
sjpb Sep 19, 2023
253f2b1
distinguish between system and user slurm binaries for generic install
sjpb Sep 19, 2023
1b92b5e
remove support for CentOS7 / OpenHPC
sjpb Sep 19, 2023
985dd3d
remove post-configure, not needed as of slurm v20.02
sjpb Sep 19, 2023
bb0ad77
add openmpi/IMB-MPI1 by default for generic install
sjpb Sep 19, 2023
caebc4f
allow removal of slurm.conf options
sjpb Sep 19, 2023
7e71087
update README
sjpb Sep 20, 2023
f658f4b
Merge branch 'installonly' into feat/no-ohpc
sjpb Sep 20, 2023
336ba63
enable openhpc_extra_repos for both generic and ohpc installs
sjpb Sep 20, 2023
050e449
README tweak
sjpb Sep 20, 2023
b096101
add openhpc_config_files parameter
sjpb Sep 20, 2023
d0d7dbf
change library_dir to lib_dir
sjpb Sep 20, 2023
10cb71a
fix perms
sjpb Sep 20, 2023
6168d45
Merge branch 'master' into feat/no-ohpc
sjpb Sep 20, 2023
cb6edfc
fix/silence linter warnings
sjpb Sep 20, 2023
0871414
remove packages only required for hpctests
sjpb Sep 20, 2023
58526d5
document openhpc_config_files restart behaviour
sjpb Sep 22, 2023
0fcaf69
bugfix missing newline in slurm.conf
sjpb Sep 26, 2023
5b9b106
make path for slurm.conf configurable
sjpb Sep 26, 2023
95c4df8
make slurm.conf template src configurable
sjpb Sep 26, 2023
2b8b8c5
symlink slurm user tools so monitoring works
sjpb Sep 27, 2023
edcfb00
fix slurm directories
sjpb Oct 6, 2023
1f14dbd
fix slurmdbd path for non-default slurm.conf paths
sjpb Oct 10, 2023
295f943
Merge branch 'master' into feat/no-ohpc
sjpb Jan 24, 2024
a5d106f
default gres.conf to correct directory
sjpb Feb 16, 2024
5b73b8a
document <absent> for openhpc_config
sjpb Feb 20, 2024
8412606
Merge branch 'master' into feat/no-ohpc
sjpb Feb 27, 2024
69e25ac
minor merge diff fixes
sjpb Feb 27, 2024
23ddc82
Fix EPEL not getting installed
sjpb Feb 27, 2024
59ee7cc
build RL9.3 container images with systemd
sjpb Mar 19, 2024
2aaa605
Merge branch 'master' into feat/no-ohpc
sjpb Mar 20, 2024
513516c
allow use on image containing slurm binaries
sjpb Jul 23, 2024
a34dace
prepend slurm binaries to PATH instead of symlinking
sjpb Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 34 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,35 @@

# stackhpc.openhpc

This Ansible role installs packages and performs configuration to provide an OpenHPC v2.x Slurm cluster.
This Ansible role installs packages and performs configuration to provide a Slurm cluster. By default this uses packages from [OpenHPC v2.x](https://openhpc.community/) but it is also possible to use alternative Slurm binaries and packages.

As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches.

The minimal image for nodes is a RockyLinux 8 GenericCloud image.

## Task files
This role provides four task files which can be selected by using the `tasks_from` parameter of Ansible's `import_role` or `include_role` modules:
- `main.yml`: Runs `install-ohpc.yml` and `runtime.yml`. Default if no `tasks_from` parameter is used.
- `install-ohpc.yml`: Installs repos and packages for OpenHPC.
- `install-generic.yml`: Installs systemd units etc. for user-provided binaries.
- `runtime.yml`: Slurm/service configuration.

## Role Variables

Variables only relevant for `install-ohpc.yml` or `install-generic.yml` task files are marked as such below.

`openhpc_extra_repos`: Optional list. Extra Yum repository definitions to configure, following the format of the Ansible
[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module. Respected keys for
each list element:
* `name`: Required
* `description`: Optional
* `file`: Required
* `baseurl`: Optional
* `metalink`: Optional
* `mirrorlist`: Optional
* `gpgcheck`: Optional
* `gpgkey`: Optional

`openhpc_slurm_service_enabled`: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld).
[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module.

`openhpc_slurm_service_enabled`: Optional boolean, whether to enable the appropriate slurm service (slurmd/slurmctld). Default `true`.

`openhpc_slurm_service_started`: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to `openhpc_slurm_service_enabled`.

`openhpc_slurm_control_host`: Required string. Ansible inventory hostname (and short hostname) of the controller e.g. `"{{ groups['cluster_control'] | first }}"`.

`openhpc_slurm_control_host_address`: Optional string. IP address or name to use for the `openhpc_slurm_control_host`, e.g. to use a different interface than is resolved from `openhpc_slurm_control_host`.

`openhpc_packages`: additional OpenHPC packages to install.
`openhpc_packages`: additional OpenHPC packages to install (`install-ohpc.yml` only).

`openhpc_enable`:
* `control`: whether to enable control host
Expand All @@ -46,7 +46,19 @@ each list element:

`openhpc_login_only_nodes`: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run `slurmd` to contact the control node for config.

`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one.
`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one (`install-ohpc.yml` only).

`openhpc_generic_packages`: Optional. List of system packages to install, see `defaults/main.yml` for details (`install-generic.yml` only).

`openhpc_sbin_dir`: Optional. Path to slurm daemon binaries such as `slurmctld`, default `/usr/sbin` (`install-generic.yml` only).

`openhpc_bin_dir`: Optional. Path to Slurm user binaries such as `sinfo`, default `/usr/bin` (`install-generic.yml` only).

`openhpc_lib_dir`: Optional. Path to Slurm libraries, default `/usr/lib64/slurm` (`install-generic.yml` only).

`openhpc_config_files`: Optional. List of additional Slurm configuration files to template. Changes to any templated files will restart `slurmctld` and `slurmd`s. The default templates `gres.conf` on the control node. List elements are dicts which must contain:
- `template`: A dict with parameters for Ansible's [template](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/template_module.html) module.
- `enable`: String `control`, `batch`, `database` or `runtime` specifying nodes to template this file on (i.e. matches keys from `openhpc_enable`). Any other string results in no templating.

### slurm.conf

Expand Down Expand Up @@ -79,12 +91,18 @@ For each group (if used) or partition any nodes in an ansible inventory group `<

`openhpc_cluster_name`: name of the cluster.

`openhpc_config`: Optional. Mapping of additional parameters and values for `slurm.conf`. Note these will override any included in `templates/slurm.conf.j2`.
`openhpc_config`: Optional. Mapping of additional parameters and values for `slurm.conf`. Note these will override any included in `templates/slurm.conf.j2`. Setting a parameter's value to the string `<absent>` will omit a parameter which is included in the template.

`openhpc_ram_multiplier`: Optional, default `0.95`. Multiplier used in the calculation: `total_memory * openhpc_ram_multiplier` when setting `RealMemory` for the partition in slurm.conf. Can be overriden on a per partition basis using `openhpc_slurm_partitions.ram_multiplier`. Has no effect if `openhpc_slurm_partitions.ram_mb` is set.

`openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation))

`openhpc_slurmd_spool_dir`: Optional. Absolute path for slurmd state (`slurm.conf` parameter [SlurmdSpoolDir](https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir))

`openhpc_slurm_conf_template`: Optional. Path of Jinja template for slurm.conf configuration file. Default is `slurm.conf.j2` template in role. **NB:** The required templating is complex, if just setting specific parameters use `openhpc_config` intead.

`openhpc_slurm_conf_path`: Optional. Path to template `slurm.conf` configuration file to. Default `/etc/slurm/slurm.conf`

#### Accounting

By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`<sup id="accounting_storage">[1](#slurm_ver_footnote)</sup>. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting:
Expand Down
29 changes: 21 additions & 8 deletions defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,18 @@ openhpc_job_maxtime: '60-0' # quote this to avoid ansible converting some format
openhpc_config: "{{ openhpc_extra_config | default({}) }}"
openhpc_gres_template: gres.conf.j2
openhpc_slurm_configless: "{{ 'enable_configless' in openhpc_config.get('SlurmctldParameters', []) }}"

openhpc_state_save_location: /var/spool/slurm
openhpc_slurmd_spool_dir: /var/spool/slurm
openhpc_slurm_conf_path: /etc/slurm/slurm.conf
openhpc_slurm_conf_template: slurm.conf.j2
openhpc_config_files:
- template:
dest: "{{ openhpc_slurm_conf_path | dirname }}/gres.conf"
src: "{{ openhpc_gres_template }}"
mode: "0600"
owner: slurm
group: slurm
enable: control

# Accounting
openhpc_slurm_accounting_storage_host: "{{ openhpc_slurmdbd_host }}"
Expand Down Expand Up @@ -45,6 +55,15 @@ openhpc_enable:
database: false
runtime: false

# Only used for install-generic.yml:
openhpc_generic_packages:
- munge
- mariadb-connector-c # only required on slurmdbd
- hwloc-libs # only required on slurmd
openhpc_sbin_dir: /usr/sbin # path to slurm daemon binaries (e.g. slurmctld)
openhpc_bin_dir: /usr/bin # path to slurm user binaries (e.g sinfo)
openhpc_lib_dir: /usr/lib64/slurm # path to slurm libraries

# Repository configuration
openhpc_extra_repos: []

Expand All @@ -62,22 +81,16 @@ ohpc_openhpc_repos:
baseurl: "http://repos.openhpc.community/OpenHPC/2/updates/CentOS_8"
gpgcheck: true
gpgkey: https://raw.githubusercontent.com/openhpc/ohpc/v2.6.1.GA/components/admin/ohpc-release/SOURCES/RPM-GPG-KEY-OpenHPC-2

ohpc_default_extra_repos:
"8":
- name: epel
file: epel
description: "Extra Packages for Enterprise Linux 8 - $basearch"
metalink: "https://mirrors.fedoraproject.org/metalink?repo=epel-8&arch=$basearch&infra=$infra&content=$contentdir"
gpgcheck: true
gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-8"

# Concatenate all repo definitions here
ohpc_repos: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] + ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}"

openhpc_munge_key:
openhpc_login_only_nodes: ''
openhpc_module_system_install: true
openhpc_module_system_install: true # only works for install-ohpc.yml/main.yml

# Auto detection
openhpc_ram_multiplier: 0.95
76 changes: 76 additions & 0 deletions tasks/install-generic.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
- include_tasks: pre.yml

- name: Create a list of slurm daemons
set_fact:
_ohpc_daemons: "{{ _ohpc_daemon_map | dict2items | selectattr('value') | items2dict | list }}"
vars:
_ohpc_daemon_map:
slurmctld: "{{ openhpc_enable.control }}"
slurmd: "{{ openhpc_enable.batch }}"
slurmdbd: "{{ openhpc_enable.database }}"

- name: Ensure extra repos
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
loop: "{{ openhpc_extra_repos }}"
loop_control:
label: "{{ item.name }}"

- name: Install system packages
dnf:
name: "{{ openhpc_generic_packages }}"

- name: Create Slurm user
user:
name: slurm
comment: SLURM resource manager
home: /etc/slurm
shell: /sbin/nologin

- name: Create Slurm unit files
template:
src: "{{ item }}.service.j2"
dest: /lib/systemd/system/{{ item }}.service
owner: root
group: root
mode: ug=rw,o=r
loop: "{{ _ohpc_daemons }}"
register: _slurm_systemd_units

- name: Get current library locations
shell:
cmd: "ldconfig -v | grep -v ^$'\t'" # noqa: no-tabs risky-shell-pipe
register: _slurm_ldconfig
changed_when: false

- name: Add library locations to ldd search path
copy:
dest: /etc/ld.so.conf.d/slurm.conf
content: "{{ openhpc_lib_dir }}"
owner: root
group: root
mode: ug=rw,o=r
when: openhpc_lib_dir not in _ldd_paths
vars:
_ldd_paths: "{{ _slurm_ldconfig.stdout_lines | map('split', ':') | map('first') }}"

- name: Reload Slurm unit files
# Can't do just this from systemd module
command: systemctl daemon-reload # noqa: command-instead-of-module no-changed-when no-handler
when: _slurm_systemd_units.changed

- name: Find user binaries
find:
paths: "{{ openhpc_bin_dir }}"
register: _ohpc_binaries

- name: Symlink slurm user binaries into $PATH
file:
src: "{{ item.path }}"
state: link
dest: "{{ ('/usr/bin', item.path | basename) | path_join }}"
owner: root
group: root
mode: u=rwx,go=rx
loop: "{{ _ohpc_binaries.files }}"
loop_control:
label: "{{ item.path }}"
18 changes: 8 additions & 10 deletions tasks/install.yml → tasks/install-ohpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,14 @@
- include_tasks: pre.yml

- name: Ensure OpenHPC repos
ansible.builtin.yum_repository:
name: "{{ item.name }}"
description: "{{ item.description | default(omit) }}"
file: "{{ item.file }}"
baseurl: "{{ item.baseurl | default(omit) }}"
metalink: "{{ item.metalink | default(omit) }}"
mirrorlist: "{{ item.mirrorlist | default(omit) }}"
gpgcheck: "{{ item.gpgcheck | default(omit) }}"
gpgkey: "{{ item.gpgkey | default(omit) }}"
loop: "{{ ohpc_repos }}"
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
loop: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] }}"
loop_control:
label: "{{ item.name }}"

- name: Ensure extra repos
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
loop: "{{ openhpc_extra_repos }}"
loop_control:
label: "{{ item.name }}"

Expand Down
2 changes: 1 addition & 1 deletion tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

- name: Install packages
block:
- include_tasks: install.yml
- include_tasks: install-ohpc.yml
when: openhpc_enable.runtime | default(false) | bool
tags: install

Expand Down
42 changes: 22 additions & 20 deletions tasks/runtime.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,19 @@

- name: Ensure Slurm directories exists
file:
path: "{{ openhpc_state_save_location }}"
path: "{{ item.path }}"
owner: slurm
group: slurm
mode: 0755
mode: '0755'
state: directory
when: inventory_hostname == openhpc_slurm_control_host
loop:
- path: "{{ openhpc_state_save_location }}" # StateSaveLocation
enable: control
- path: "{{ openhpc_slurm_conf_path | dirname }}"
enable: runtime
- path: "{{ openhpc_slurmd_spool_dir }}" # SlurmdSpoolDir
enable: batch
when: "openhpc_enable[item.enable] | default(false) | bool"

- name: Generate a Munge key on control host
# NB this is usually a no-op as the package install actually generates a (node-unique) one, so won't usually trigger handler
Expand Down Expand Up @@ -65,7 +72,7 @@
- name: Template slurmdbd.conf
template:
src: slurmdbd.conf.j2
dest: /etc/slurm/slurmdbd.conf
dest: "{{ openhpc_slurm_conf_path | dirname }}/slurmdbd.conf"
mode: "0600"
owner: slurm
group: slurm
Expand All @@ -82,7 +89,7 @@

- name: Template basic slurm.conf
template:
src: slurm.conf.j2
src: "{{ openhpc_slurm_conf_template }}"
dest: "{{ _slurm_conf_tmpfile.path }}"
lstrip_blocks: true
mode: 0644
Expand All @@ -98,6 +105,7 @@
section: ''
value: "{{ (item.value | join(',')) if (item.value is sequence and item.value is not string) else item.value }}"
no_extra_spaces: true
state: "{{ 'absent' if item.value == '<absent>' else 'present' }}"
create: no
mode: 0644
loop: "{{ openhpc_config | dict2items }}"
Expand All @@ -109,27 +117,21 @@
- name: Create slurm.conf
copy:
src: "{{ _slurm_conf_tmpfile.path }}"
dest: /etc/slurm/slurm.conf
dest: "{{ openhpc_slurm_conf_path }}"
owner: root
group: root
mode: 0644
when: openhpc_enable.control | default(false) or not openhpc_slurm_configless
notify:
- Restart slurmctld service
notify: Restart slurmctld service
register: ohpc_slurm_conf
# NB uses restart rather than reload as number of nodes might have changed

- name: Create gres.conf
template:
src: "{{ openhpc_gres_template }}"
dest: /etc/slurm/gres.conf
mode: "0600"
owner: slurm
group: slurm
when: openhpc_enable.control | default(false) or not openhpc_slurm_configless
notify:
- Restart slurmctld service
register: ohpc_gres_conf
- name: Template other Slurm configuration files
template: "{{ item.template }}" # noqa: risky-file-permissions
loop: "{{ openhpc_config_files }}"
when: "openhpc_enable[item.enable] | default(false) | bool"
notify: Restart slurmctld service
register: ohpc_other_conf
# NB uses restart rather than reload as this is needed in some cases

- name: Remove local tempfile for slurm.conf templating
Expand All @@ -147,7 +149,7 @@
changed_when: true
when:
- openhpc_slurm_control_host in ansible_play_hosts
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_other_conf.changed # noqa no-handler
notify:
- Restart slurmd service

Expand Down
1 change: 1 addition & 0 deletions templates/slurm.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#
ClusterName={{ openhpc_cluster_name }}
SlurmctldHost={{ openhpc_slurm_control_host }}{% if openhpc_slurm_control_host_address is defined %}({{ openhpc_slurm_control_host_address }}){% endif %}

#SlurmctldHost=
#
#DisableRootJobs=NO
Expand Down
22 changes: 22 additions & 0 deletions templates/slurmctld.service.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service
Wants=network-online.target
ConditionPathExists={{ openhpc_slurm_conf_path }}

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
EnvironmentFile=-/etc/default/slurmctld
ExecStart={{ openhpc_sbin_dir }}/slurmctld -D -s -f {{ openhpc_slurm_conf_path }} $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity

# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null

[Install]
WantedBy=multi-user.target
Loading
Loading