Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slurm configuration in prod2313 #17

Closed
wants to merge 2 commits into from
Closed

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Jan 31, 2024

  • Using Enable use of custom Slurm builds ansible-role-openhpc#163 means openhpc_slurmd_spool_dir has to be specified instead of openhpc_config_extra.SlurmdSpoolDir, so that the role can actually create the spool dir.
  • The openhpc_config_extra dict is defined both environments/nrel/inventory/group_vars/openhpc/overrides and in environments/{prod,vtest}/inventory/group_vars/all/openhpc-generic-slurm.yml. So in both environments the desired nrel config is not actually getting applied.
  • environments/nrel/inventory/group_vars/openhpc/overrides.yml defines openhpc_config_extra.StateSaveLocation: /var/spool/slurm/slurmctld. Looking at the terraform for prod and vtest neither define volumes. And there's no state share in environments/nrel/inventory/group_vars/os_manila/overrides.yml. Is state on a persistent disk at all in the cluster? If not, we should fix this. appliances_state_dir should be set, then the defaults will do the right thing (once that override is removed). NB: should retrieve the slurmctld state from the current directory BEFORE reimaging the cluster!
  • openhpc_packages_extra_nrel -> openhpc_packages_extra which won't be applied when using generic slurm. Also this contains a lot of openhpc-specific packages. Also needs to be combined with the example openhpc_generic_packages provided.
  • openhpc_*_dir is defined differently in nrel and vtest - is this required?
  • [ ]

@sjpb
Copy link
Collaborator Author

sjpb commented Feb 1, 2024

Replaced by PR on correctly-named branch: #18

@sjpb sjpb closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant