Skip to content

Development notes

Steve Brasier edited this page Mar 30, 2022 · 28 revisions

Interfaces

What information is required as input to the cluster/nodes.

Groups:

  • login
  • compute
  • control

Group/host vars:

Key things:

  • All slurmd nodes (default: login, compute) need to know control address
  • All NFS client nodes (default: login, compute) need to know server (default: control) address.
  • For smslabs, control node needs to know login private IP because openondemand_servername is defined using it in group_vars/all/openondemand.yml as we use SOCKS proxy to access. But generally, grafana (default: control) will need to know openondemand (default: login) external address.

Full list:

  • Cluster name. Var openhpc_cluster_name. REQUIRED in environment inventory

  • Slurmctld address. Var openhpc_slurm_control_host. Default in common:all:openhpc = {{ groups['control'] | first }}. NB: maybe should use .internal_address?

  • Partition definitions. Var openhpc_slurm_partitions. Default in common:all:openhpc is single 'compute' partition. NB: requires group "{{ openhpc_cluster_name }}_compute" in environment inventory. Could check groups during validation??

  • If using nfs: Var nfs_server. Default in common:all:nfs is nfs_server_default -> "{{ hostvars[groups['control'] | first ].internal_address }}".

  • All the "service endpoints" in common:all:defaults:

      elasticsearch_address: "{{ hostvars[groups['opendistro'].0].api_address }}"
      prometheus_address: "{{ hostvars[groups['prometheus'].0].api_address }}"
      openondemand_address: "{{ hostvars[groups['openondemand'].0].api_address if groups['openondemand'] | count > 0 else '' }}"
    
  • All the secrets in envionrment:all:secrets - see secret role's defaults:

    • grafana, elasticsearch, mysql (x2) passwords (all potentially depending on group placement)
    • munge key (for all openhpc nodes)

Running install tasks only

Which roles can we ONLY run the install tasks from, to build a cluster-independent(*)/no-config image?

In-appliance roles:

  • basic_users: n/a
  • block_devices: n/a
  • filebeat: n/a but downloads Docker container at service start)
  • grafana-dashboards: Downloads grafana dashboards
  • grafana-datasources: n/a
  • hpctests: n/a but reqd. packages are installed as part of openhpc_default_packages.
  • opendistro: n/a but downloads Docker container at service start.
  • openondemand:
  • passwords: n/a
  • podman: prereqs.yml Does package installs

Out of appliance roles:

  • stackhpc.nfs: [main.yml(https://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/main.yml) installs packages.
  • stackhpc.openhpc: Required and openhpc_packages (see above) installed in install.yml but requires openhpc_slurm_service fact set from main.yml.
  • cloudalchemy.node_exporter:
    • install.yml does binary download from github but also propagation. Could pre-download it and use node_exporter_binary_local_dir but install.yml still needs running as it does user creation too.
    • selinux.yml also does package installations
  • cloudalchemy.blackbox-exporter: Currently unused.
  • cloudalchemy.prometheus: install.yml. Same comments as for cloudalchemy.node_exporter above.
  • cloudalchemy.alertmanager: Currently unused.
  • cloudalchemy.grafana: install.yml does package updates.
  • geerlingguy.mysql: setup-RedHat.yml does package updates BUT needs variables.yml running to load appropriate variables.
  • jriguera.configdrive: Unused, should be deleted.
  • osc.ood: See openondemand above.
  • It's not really cluster-independent as which features are turned on where may vary.
Clone this wiki locally