-
Notifications
You must be signed in to change notification settings - Fork 26
Development notes
What information is required as input to the cluster/nodes.
Groups:
login
compute
control
Group/host vars:
Odd things
- For smslabs, control node needs to know login private IP because
openondemand_servername
is defined using it in group_vars/all/openondemand.yml as we use SOCKS proxy to access. But generally,grafana
(default: control) will need to know openondemand (default: login) external address.
Full list for everything
cluster is shown below.
Note that api_address
and internal_address
for hosts both default to inventory_hostname
.
-
openhpc_cluster_name
: Cluster name. No default, must be set. -
openhpc_slurm_control_host
: Slurmctld address. Default in common:all:openhpc ={{ groups['control'] | first }}
.-
NB: maybe should use
.internal_address
? - Required for all
openhpc
hosts. Is needed asdelegate_to
so must be an inventory_hostname. Is also used as address of slurm controller, which is overloading it really - Note Slurm assumes slurmdbd and slurm.conf are in same directory, how does this work configless?
- For
slurmd
nodes, we could rewrite /etc/sysconfig/slurmd using cloud-config'swrite_files
.
-
NB: maybe should use
-
openhpc_slurm_partitions
: Partition definitions. Default in common:all:openhpc is single 'compute' partition. NB: requires group"{{ openhpc_cluster_name }}_compute"
in environment inventory. Could check groups during validation??- Host requirements & comments as above (but for control only)
-
nfs_server
. Default in common:all:nfs isnfs_server_default
->"{{ hostvars[groups['control'] | first ].internal_address }}"
. Required for all clients.- For client nodes, could rewrite
fstab
(done byhttps://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/nfs-clients.yml
) using cloud-config's mount module.
- For client nodes, could rewrite
-
elasticsearch_address
: Default in common:all:defaults is{{ hostvars[groups['opendistro'].0].api_address }}
. Required forfilebeat
andgrafana
hosts.- Usage: usage search
-
ansible/roles/filebeat/tasks/config.yml templates out from
filebeat_config_path
which is [environments/common/files/filebeat/filebeat.yml]https://github.com/stackhpc/ansible-slurm-appliance/blob/main/environments/common/files/filebeat/filebeat.yml). This contains:
output.elasticsearch:
hosts: ["{{ elasticsearch_address }}:9200"]
protocol: "https"
ssl.verification_mode: none
username: "admin"
password: "{{ vault_elasticsearch_admin_password }}"
(docs). Looks like these support environment vars so potentially could set this from a systemd unit file fragment.
-
prometheus_address
: Default in common:all:defaults is{{ hostvars[groups['prometheus'].0].api_address }}
Required forprometheus
andgrafana
hosts - link -
openondemand_address
: Default in common:all:defaults is{{ hostvars[groups['openondemand'].0].api_address if groups['openondemand'] | count > 0 else '' }}
. Required for prometheus host - NB this should probably be in prometheus group vars. -
grafana_address
: Default in common:all:grafana is{{ hostvars[groups['grafana'].0].api_address }}
. Required for grafana host link.- This should probably be moved to common:all:defaults in line with other service endpoints
-
openondemand_servername
: Non-functional default''
, must be set. Required foropenondemand
host andgrafana
host link when both grafana and openondemand exist (which they do foreverything
). NB this probably requires either a) a FIP or b) a fixed IP when using SOCKS proxy. For latter case this means the control host needs to have the login node's fixed IP available. -
All the secrets in environment:all:secrets - see secret role's defaults:
- grafana, elasticsearch, mysql (x2) passwords (all potentially depending on group placement)
-
vault_openhpc_mungekey
-> `openhpc_munge_key' (for all openhpc nodes):- could rewrite /etc/munge/munge.key using cloud-init
write_files
.
- could rewrite /etc/munge/munge.key using cloud-init
Which roles can we ONLY run the install tasks from, to build a cluster-independent(*)/no-config image?
In-appliance roles:
- basic_users: n/a
- block_devices: n/a
- filebeat: n/a but downloads Docker container at service start)
- grafana-dashboards: Downloads grafana dashboards
- grafana-datasources: n/a
- hpctests: n/a but reqd. packages are installed as part of
openhpc_default_packages
. - opendistro: n/a but downloads Docker container at service start.
- openondemand:
-
main.yml
unnamed task does rpm installs using osc.ood:install-rpm.yml -
main.yml
unnamed task does rpm installs using pam_auth.yml. -
main.yml
[unnamed task] does git downloads using osc.ood:install-apps.yml -
jupyter_compute.yml
: Does package installs -
vnc_compute.yml
: Does package installs
-
- passwords: n/a
- podman:
prereqs.yml
Does package installs
Out of appliance roles:
- stackhpc.nfs: [main.yml(https://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/main.yml) installs packages.
- stackhpc.openhpc: Required and
openhpc_packages
(see above) installed in install.yml but requiresopenhpc_slurm_service
fact set frommain.yml
. - cloudalchemy.node_exporter:
-
install.yml does binary download from github but also propagation. Could pre-download it and use
node_exporter_binary_local_dir
but install.yml still needs running as it does user creation too. - selinux.yml also does package installations
-
install.yml does binary download from github but also propagation. Could pre-download it and use
- cloudalchemy.blackbox-exporter: Currently unused.
- cloudalchemy.prometheus: install.yml. Same comments as for
cloudalchemy.node_exporter
above. - cloudalchemy.alertmanager: Currently unused.
- cloudalchemy.grafana: install.yml does package updates.
- geerlingguy.mysql: setup-RedHat.yml does package updates BUT needs variables.yml running to load appropriate variables.
- jriguera.configdrive: Unused, should be deleted.
- osc.ood: See
openondemand
above.
- It's not really cluster-independent as which features are turned on where may vary.